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Preface 


Theoretical  computer  science  treats  any  computational  subject  for  which  a  good  model  can  be 
created.  Research  on  formal  models  of  computation  was  initiated  in  the  1930s  and  1940s  by 
Turing,  Post,  Kleene,  Church,  and  others.  In  the  1950s  and  1960s  programming  languages, 
language  translators,  and  operating  systems  were  under  development  and  therefore  became 
both  the  subject  and  basis  for  a  great  deal  of  theoretical  work.  The  power  of  computers  of 
this  period  was  limited  by  slow  processors  and  small  amounts  of  memory,  and  thus  theories 
(models,  algorithms,  and  analysis)  were  developed  to  explore  the  efficient  use  of  computers  as 
well  as  the  inherent  complexity  of  problems.  The  former  subject  is  known  today  as  algorithms 
and  data  structures,  the  latter  computational  complexity. 

The  focus  of  theoretical  computer  scientists  in  the  1960s  on  languages  is  reflected  in  the 
first  textbook  on  the  subject,  Formal  Languages  and  Their  Relation  to  Automata  by  John 
Hopcroft  and  Jeffrey  Ullman.  This  influential  book  led  to  the  creation  of  many  language- 
centered  theoretical  computer  science  courses;  many  introductory  theory  courses  today  con¬ 
tinue  to  reflect  the  content  of  this  book  and  the  interests  of  theoreticians  of  the  1 960s  and  early 
1970s. 

Although  the  1970s  and  1980s  saw  the  development  of  models  and  methods  of  analysis 
directed  at  understanding  the  limits  on  the  performance  of  computers,  this  attractive  new 
material  has  not  been  made  available  at  the  introductory  level.  This  book  is  designed  to  remedy 
this  situation. 

This  book  is  distinguished  from  others  on  theoretical  computer  science  by  its  primary  focus 
on  real  problems,  its  emphasis  on  concrete  models  of  machines  and  programming  styles,  and 
the  number  and  variety  of  models  and  styles  it  covers.  These  include  the  logic  circuit,  the  finite- 
state  machine,  the  pushdown  automaton,  the  random-access  machine,  memory  hierarchies, 
the  PRAM  (parallel  random-access  machine),  the  VLSI  (very  large-scale  integrated)  chip,  and 
a  variety  of  parallel  machines. 
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The  book  covers  the  traditional  topics  of  formal  languages  and  automata  and  complexity 
classes  but  also  gives  an  introduction  to  the  more  modern  topics  of  space-time  tradeoffs,  mem¬ 
ory  hierarchies,  parallel  computation,  the  VLSI  model,  and  circuit  complexity.  These  modern 
topics  are  integrated  throughout  the  text,  as  illustrated  by  the  early  introduction  of  P-complete 
and  NP-complete  problems.  The  book  provides  the  first  textbook  treatment  of  space-time 
tradeoffs  and  memory  hierarchies  as  well  as  a  comprehensive  introduction  to  traditional  com¬ 
putational  complexity.  Its  treatment  of  circuit  complexity  is  modern  and  substantative,  and 
parallelism  is  integrated  throughout. 


Plan  of  the  Book 

The  book  has  three  parts.  Part  I  (Chapter  1)  is  an  overview.  Part  II,  consisting  of  Chapters  2-7, 
provides  an  introduction  to  general  computational  models.  Chapter  2  introduces  logic  circuits 
and  derives  upper  bounds  on  the  size  and  depth  of  circuits  for  important  problems.  The  finite- 
state,  random-access,  and  Turing  machine  models  are  defined  in  Chapter  3  and  circuits  are 
presented  that  simulate  computations  performed  by  these  machines.  From  such  simulations 
arise  results  of  two  kinds.  First,  computational  inequalities  of  the  form  C{f)  <  K ST  are 
derived  for  problems  /  run  on  the  random-access  machine,  where  C(f)  is  the  size  of  the 
smallest  circuit  for  /,  n  is  a  constant,  and  S  and  T  are  storage  space  and  computation  time. 
If  ST  is  too  small  relative  to  C(f),  the  problem  /  cannot  be  solved.  Second,  the  same  circuit 
simulations  are  interpreted  to  identify  P-complete  and  NP-complete  problems.  P-complete 
problems  can  all  be  solved  in  polynomial  time  but  are  believed  hard  to  solve  fast  on  parallel 
machines.  The  NP-complete  problems  include  many  important  scheduling  and  optimization 
problems  and  are  believed  not  solvable  in  polynomial  time  on  serial  machines. 

Part  II  also  contains  traditional  material  on  formal  languages  and  automata.  Chapter  4 
explores  the  connection  between  two  machine  models  (the  finite-state  machine  and  the  push¬ 
down  automaton)  and  language  types  in  the  Chomsky  hierarchy.  Chapter  5  examines  Turing 
machines.  It  shows  that  the  languages  recognized  by  them  are  the  phrase-structure  languages, 
the  most  expressive  of  the  language  types  in  the  Chomsky  hierarchy.  This  chapter  also  exam¬ 
ines  universal  Turing  machines,  reducibility,  unsolvable  problems,  and  the  functions  computed 
by  Turing  machines. 

Finally,  Part  II  contains  Chapters  6  and  7  which  introduce  algebraic  and  combinatorial 
circuits  and  parallel  machine  models,  respectively.  Algebraic  and  combinatorial  circuits  are 
graphs  of  straight-line  programs  of  the  kind  typically  used  for  matrix  multiplication  and  in¬ 
version,  solving  linear  systems  of  equations,  computing  the  fast  Fourier  transform,  performing 
convolutions,  and  merging  and  sorting.  Chapter  6  contains  reference  material  on  problems 
used  in  later  chapters  to  illustrate  models  and  lower-bound  arguments.  Parallel  machine  mod¬ 
els  such  as  the  PRAM  and  networks  of  computers  organized  as  meshes  and  hypercubes  are 
studied  in  Chapter  7.  A  framework  is  given  for  the  design  of  algorithms  and  derivation  of 
lower  bounds  on  performance. 

Part  III,  a  comprehensive  treatment  of  computational  complexity,  consists  of  Chapters  8— 
12.  Chapter  8  provides  a  comprehensive  survey  of  traditional  computational  complexity.  Using 
serial  and  parallel  machine  models,  it  examines  time-  and  space-bounded  complexity  classes, 
including  the  P-complete,  NP-complete  and  PSPACE-complete  languages  as  well  as  the  circuit 
complexity  classes  NC  and  P/poly.  This  chapter  also  establishes  the  connections  between  de- 
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terministic  and  nondeterministic  space  complexity  classes  and  shows  that  the  nondeterministic 
space  classes  are  closed  under  complements. 

Circuit  complexity  is  the  topic  of  Chapter  9.  Methods  for  deriving  lower  bounds  on  circuit 
size  and  depth  are  given  for  general  circuits,  formulas,  monotone  circuits,  and  bounded-depth 
circuits.  This  modern  treatment  of  circuit  complexity  complements  Chapter  2,  which  derives 
tight  upper  bounds  on  circuit  size  and  depth. 

Space-time  tradeoffs  are  studied  in  Chapter  10  using  two  computational  models,  the 
branching  program  and  the  pebble  game,  which  capture  the  notions  of  space  and  time  for 
many  programs  for  which  branching  is  and  is  not  allowed,  respectively.  Methods  for  deriving 
lower  bounds  on  the  exchange  of  space  for  time  are  presented  and  applied  to  a  representative 
set  of  problems. 

Chapter  1 1  examines  models  for  memory  hierarchy  systems.  It  uses  the  pebble  game  with 
pebbles  of  multiple  colors  to  designate  storage  locations  at  different  levels  of  a  hierarchy,  and 
also  employs  block  and  RAM-based  models.  Again,  lower  bounds  on  performance  are  derived 
and  compared  with  the  performance  of  algorithms.  This  chapter  also  has  a  brief  treatment  of 
the  LRU  and  FIFO  memory-management  algorithms  that  uses  competitive  analysis  to  com¬ 
pare  their  performance  to  that  of  the  optimal  algorithm. 

The  book  closes  with  Chapter  12  on  the  VLSI  model  for  integrated  circuits.  In  this  model 
both  chip  area  A  and  time  T  are  important,  and  methods  are  given  for  deriving  lower  bounds 
on  measures  such  as  AT2.  Chip  layouts  and  VLSI  algorithms  are  also  exhibited  whose  perfor¬ 
mance  comes  close  to  matching  the  lower  bounds. 


Use  of  the  Book 

Many  different  courses  can  be  designed  around  this  book.  A  core  undergraduate  computer 
science  course  can  be  taught  using  Parts  I  and  II  and  some  material  from  Chapter  8.  The 
first  course  on  theoretical  computer  science  for  majors  at  Brown  uses  most  of  Chapters  1-5 
except  for  the  advanced  material  in  Chapters  2  and  3.  It  uses  a  few  elementary  sections  from 
Chapters  10  and  1 1  to  emphasize  space-time  tradeoffs,  which  play  a  central  role  in  Chapter  3 
and  lead  into  the  study  of  formal  languages  and  automata  in  Chapter  4.  After  covering  the 
material  of  Chapter  5,  a  few  lectures  are  given  on  NP-complete  problems  from  Chapter  8. 

This  introductory  course  has  four  programming  assignments  in  Scheme  that  illustrate  the 
ideas  embodied  in  Chapters  2,  3  and  5.  The  first  program  solves  the  circuit-value  problem, 
that  is,  it  executes  a  straight-line  program,  thereby  producing  the  outputs  defined  by  this 
program.  The  second  program  writes  a  straight-line  program  simulating  T  steps  by  a  finite- 
state  machine.  The  third  program  writes  a  straight-line  program  simulating  T  steps  by  a 
one-tape  Turing  machine  (this  is  the  reduction  involved  in  the  Cook-Levin  theorem)  and  the 
fourth  one  simulates  a  universal  Turing  machine. 

Several  different  advanced  courses  can  be  assembled  from  the  material  of  Part  III  and 
introductory  material  of  Part  II.  For  example,  a  course  on  concrete  computational  complexity 
can  be  assembled  around  Chapters  1 0  and  1 1 ,  which  examine  tradeoffs  between  space  and 
time  in  primary  and  secondary  memory.  This  course  would  presume  or  include  introductory 
material  from  Chapter  3. 

An  advanced  course  emphasizing  traditional  computational  complexity  can  be  based  pri¬ 
marily  on  computability  (Chapter  5)  and  complexity  classes  (Chapter  8)  and  some  material  on 
circuit  complexity  from  Chapter  9. 
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An  advanced  course  on  circuit  complexity  can  be  assembled  from  Chapter  2  on  logic  cir¬ 
cuits  and  Chapter  9  on  circuit  complexity.  The  former  describes  efficient  circuits  for  a  variety 
of  functions  while  the  latter  surveys  methods  for  deriving  lower  bounds  to  circuit  complexity. 
The  titles  of  sections  containing  advanced  material  carry  an  asterisk. 
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Exploring  the  Power  of  Computing 


Part  I 

OVERVIEW  OF  THE  BOOK 


CHAPTER 


The  Role  of  Theory  in 
Computer  Science 


Computer  science  is  the  study  of  computers  and  programs,  the  collections  of  instructions  that 
direct  the  activity  of  computers.  Although  computers  are  made  of  simple  elements,  the  tasks 
they  perform  are  often  very  complex.  The  great  disparity  between  the  simplicity  of  computers 
and  the  complexity  of  computational  tasks  offers  intellectual  challenges  of  the  highest  order.  It 
is  the  models  and  methods  of  analysis  developed  by  computer  science  to  meet  these  challenges 
that  are  the  subject  of  theoretical  computer  science. 

Computer  scientists  have  developed  models  for  machines,  such  as  the  random-access  and 
Turing  machines;  for  languages,  such  as  regular  and  context-free  languages;  for  programs,  such 
as  straight-line  and  branching  programs;  and  for  systems  of  programs,  such  as  compilers  and 
operating  systems.  Models  have  also  been  developed  for  data  structures,  such  as  heaps,  and  for 
databases,  such  as  the  relational  and  object-oriented  databases. 

Methods  of  analysis  have  been  developed  to  study  the  efficiency  of  algorithms  and  their 
data  structures,  the  expressibility  of  languages  and  the  capacity  of  computer  architectures  to 
recognize  them,  the  classification  of  problems  by  the  time  and  space  required  to  solve  them, 
their  inherent  complexity,  and  limits  that  hold  simultaneously  on  computational  resources  for 
particular  problems.  This  book  examines  each  of  these  topics  in  detail  except  for  the  first, 
analysis  of  algorithms  and  data  structures,  which  it  covers  only  briefly. 

This  chapter  provides  an  overview  of  the  book.  Except  for  the  mathematical  preliminaries, 
the  topics  introduced  here  are  revisited  later. 
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1.1  A  Brief  History  of  Theoretical  Computer  Science 

Theoretical  computer  science  uses  models  and  analysis  to  study  computers  and  computation. 
It  thus  encompasses  the  many  areas  of  computer  science  sufficiently  well  developed  to  have 
models  and  methods  of  analysis.  This  includes  most  areas  of  the  field. 

1.1.1  Early  Years 

TURING  AND  CHURCH:  Theoretical  computer  science  emerged  primarily  from  the  work  of 
Alan  Turing  and  Alonzo  Church  in  1936,  although  many  others,  such  as  Russell,  Hilbert,  and 
Boole,  were  important  precursors.  Turing  and  Church  introduced  formal  computational  mod¬ 
els  (the  Turing  machine  and  lambda  calculus),  showed  that  some  well-stated  computational 
problems  have  no  solution,  and  demonstrated  the  existence  of  universal  computing  machines, 
machines  capable  of  simulating  every  other  machine  of  their  type. 

Turing  and  Church  were  logicians;  their  work  reflected  the  concerns  of  mathematical  logic. 
The  origins  of  computers  predate  them  by  centuries,  going  back  at  least  as  far  as  the  abacus,  if 
we  call  any  mechanical  aid  to  computation  a  computer.  A  very  important  contribution  to  the 
study  of  computers  was  made  by  Charles  Babbage,  who  in  1836  completed  the  design  of  his 
first  programmable  Analytical  Engine,  a  mechanical  computer  capable  of  arithmetic  operations 
under  the  control  of  a  sequence  of  punched  cards  (an  idea  borrowed  from  the  Jacquard  loom). 
A  notable  development  in  the  history  of  computers,  but  one  of  less  significance,  was  the  1938 
demonstration  by  Claude  Shannon  that  Boolean  algebra  could  be  used  to  explain  the  operation 
of  relay  circuits,  a  form  of  electromechanical  computer.  He  was  later  to  develop  his  profound 
“mathematical  theory  of  communication”  in  1948  as  well  as  to  lay  the  foundations  for  the 
study  of  circuit  complexity  in  1949. 

FIRST  COMPUTERS:  In  1941  Konrad  Zuse  built  the  Z3,  the  first  general-purpose  program- 
controlled  computer,  a  machine  constructed  from  electromagnetic  relays.  The  Z3  read  pro¬ 
grams  from  a  punched  paper  tape.  In  the  mid- 1940s  the  first  programmable  electronic  com¬ 
puter  (using  vacuum  tubes),  the  ENIAC,  was  developed  by  Eckert  and  Mauchly.  Von  Neu¬ 
mann,  in  a  very  influential  paper,  codified  the  model  that  now  carries  his  name.  With  the 
invention  of  the  transistor  in  1947,  electronic  computers  were  to  become  much  smaller  and 
more  powerful  than  the  30-ton  ENIAC.  The  microminiaturization  of  transistors  continues 
today  to  produce  computers  of  ever-increasing  computing  power  in  ever-shrinking  packages. 

EARLY  LANGUAGE  DEVELOPMENT:  The  first  computers  were  very  difficult  to  program  (cables 
were  plugged  and  unplugged  on  the  ENIAC).  Later,  programmers  supplied  commands  by 
typing  in  sequences  of  0’s  and  l’s,  the  machine  language  of  computers.  A  major  contribution 
of  the  1950s  was  the  development  of  programming  languages,  such  as  FORTRAN,  COBOL, 
and  LISP.  These  languages  allowed  programmers  to  specify  commands  in  mnemonic  code  and 
with  high  level  constructs  such  as  loops,  arrays,  and  procedures. 

As  languages  were  developed,  it  became  important  to  understand  their  expressiveness  as 
well  as  the  characteristics  of  the  simplest  computers  that  could  translate  them  into  machine 
language.  As  a  consequence,  formal  languages  and  the  automata  that  recognize  them  became 
an  important  topic  of  study  in  the  1950s.  Nondeterministic  models  -  models  that  may  have 
more  than  one  possible  next  state  for  the  current  state  and  input  -  were  introduced  during  this 
time  as  a  way  to  classify  languages. 
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1.1.2  1950s 

FINITE-STATE  MACHINES:  Occurring  in  parallel  with  the  development  of  languages  was  the 
development  of  models  for  computers.  The  1950s  also  saw  the  formalization  of  the  finite-state 
machine  (also  called  the  sequential  machine),  the  sequential  circuit  (the  concrete  realization  of 
a  sequential  machine),  and  the  pushdown  automaton.  Rabin  and  Scott  pioneered  the  use  of 
analytical  tools  to  study  the  capabilities  and  limitations  of  these  models. 


FORMAL  LANGUAGES:  The  late  1950s  and  1960s  saw  an  explosion  of  research  on  formal  lan¬ 
guages.  By  1964  the  Chomsky  language  hierarchy,  consisting  of  the  regular,  context-free, 
context-sensitive,  and  recursively  enumerable  languages,  was  established,  as  was  the  correspon¬ 
dence  between  these  languages  and  the  memory  organizations  of  machine  types  recognizing 
them,  namely  the  finite-state  machine,  the  pushdown  automaton,  the  linear-bounded  au¬ 
tomaton,  and  the  Turing  machine.  Many  variants  of  these  standard  grammars,  languages, 
and  machines  were  also  examined. 


1.1.3  1960s 

COMPUTATIONAL  COMPLEXITY:  The  1960s  also  saw  the  laying  of  the  foundation  for  compu¬ 
tational  complexity  with  the  classification  of  languages  and  functions  by  Hartmanis,  Lewis, 
and  Stearns  and  others  of  the  time  and  space  needed  to  compute  them.  Hierarchies  of  prob¬ 
lems  were  identified  and  speed-up  and  gap  theorems  established.  This  area  was  to  flower  and 
lead  to  many  important  discoveries,  including  that  by  Cook  (and  independently  Levin)  of 
NP  -complete  languages,  languages  associated  with  many  hard  combinatorial  and  optimiza¬ 
tion  problems,  including  the  Traveling  Salesperson  problem,  the  problem  of  determining  the 
shortest  tour  of  cities  for  which  all  intercity  distances  are  given.  Karp  was  instrumental  in 
demonstrating  the  importance  of  NP-complete  languages.  Because  problems  whose  running 
time  is  exponential  are  considered  intractable,  it  is  very  important  to  know  whether  a  string  in 
NP-complete  languages  can  be  recognized  in  a  time  polynomial  in  their  length.  This  is  called 
? 

the  P  =  NP  problem,  where  P  is  the  class  of  deterministic  polynomial-time  languages.  The 
P-complete  languages  were  also  identified  in  the  1970s;  these  are  the  hardest  languages  in  P  to 
recognize  on  parallel  machines. 

1.1.4  1970s 

COMPUTATION  TIME  AND  CIRCUIT  COMPLEXITY:  In  the  early  1970s  the  connection  between 

computation  time  on  Turing  machines  and  circuit  complexity  was  established,  thereby  giving 

? 

the  study  of  circuits  renewed  importance  and  offering  the  hope  that  the  P  =  NP  problem 
could  be  resolved  via  circuit  complexity. 

PROGRAMMING  LANGUAGE  SEMANTICS:  The  1970s  were  a  very  productive  period  for  formal 
methods  in  the  study  of  programs  and  languages.  The  area  of  programming  language  seman¬ 
tics  was  very  active;  models  and  denotations  were  developed  to  give  meaning  to  the  phrase 
“programming  language,”  thereby  putting  language  development  on  a  solid  footing.  Formal 
methods  for  ensuring  the  correctness  of  programs  were  also  developed  and  applied  to  program 
development.  The  1970s  also  saw  the  emergence  of  the  relational  database  model  and  the 
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development  of  the  relational  calculus  as  a  means  for  the  efficient  reformulation  of  database 
queries. 

SPACE-TIME  TRADEOFFS:  An  important  byproduct  of  the  work  on  formal  languages  and  se¬ 
mantics  in  the  1970s  is  the  pebble  game.  In  this  game,  played  on  a  directed  acyclic  graph, 
pebbles  are  placed  on  vertices  to  indicate  that  the  value  associated  with  a  vertex  is  located  in 
the  register  of  a  central  processing  unit.  The  game  allows  the  study  of  tradeoffs  between  the 
number  of  pebbles  (or  registers)  and  time  (the  number  of  pebble  placements)  and  leads  to 
space-time  product  inequalities  for  individual  problems.  These  ideas  were  generalized  in  the 
1980s  to  branching  programs. 

VLSI  MODEL:  When  the  very  large-scale  integration  (VLSI)  of  electronic  components  onto 
semiconductor  chips  emerged  in  the  1970s,  VLSI  models  for  them  were  introduced  and  an¬ 
alyzed.  Ideas  from  the  study  of  pebble  games  were  applied  and  led  to  tradeoff  inequalities 
relating  the  complexity  of  a  problem  to  products  such  as  AT2,  where  A  is  the  area  of  a  chip 
and  T  is  the  number  of  steps  it  takes  to  solve  a  problem.  In  the  late  1970s  and  1980s  the 
layout  of  computers  on  VLSI  chips  also  became  an  important  research  topic. 

ALGORITHMS  AND  DATA  STRUCTURES:  While  algorithms  (models  for  programs)  and  data  struc¬ 
tures  were  introduced  from  the  beginning  of  the  field,  they  experienced  a  flowering  in  the 
1 970s  and  1 980s.  Knuth  was  most  influential  in  this  development,  as  later  were  Aho,  Hopcroft, 
and  Ullman.  New  algorithms  were  invented  for  sorting,  data  storage  and  retrieval,  problems  on 
graphs,  polynomial  evaluation,  solving  linear  systems  of  equations,  computational  geometry, 
and  many  other  topics  on  both  serial  and  parallel  machines. 

1.1.5  1980s  and  1990s 

PARALLEL  COMPUTING  AND  I/O  COMPLEXITY:  The  1980s  also  saw  the  emergence  of  many 
other  theoretical  computer  science  research  topics,  including  parallel  and  distributed  comput¬ 
ing,  cryptography,  and  I/O  complexity.  A  variety  of  concrete  and  abstract  models  of  parallel 
computers  were  developed,  ranging  from  VLSI-based  models  to  the  parallel  random-access 
machine  (PRAM),  a  collection  of  synchronous  processors  alternately  reading  from  and  writ¬ 
ing  to  a  common  array  of  memory  cells  and  computing  locally.  Parallel  algorithms  and  data 
structures  were  developed,  as  were  classifications  of  problems  according  to  the  extent  to  which 
they  are  parallelizable.  I/O  complexity,  the  study  of  data  movement  among  memory  units 
in  a  memory  hierarchy,  emerged  around  1980.  Memory  hierarchies  take  advantage  of  the 
temporal  and  spatial  locality  of  problems  to  simulate  fast,  expensive  memories  with  slow  and 
inexpensive  ones. 

DISTRIBUTED  COMPUTING:  The  emergence  of  networks  of  computers  brought  to  light  some 
hard  logical  problems  that  led  to  a  theory  of  distributed  computing,  that  is,  computing  with 
multiple  and  potentially  asynchronous  processors  that  may  be  widely  dispersed.  The  prob¬ 
lems  addressed  in  this  area  include  reaching  consensus  in  the  presence  of  malicious  adversaries, 
handling  processor  failures,  and  efficiently  coordinating  the  activities  of  agents  when  interpro¬ 
cessor  latencies  are  large.  Although  some  of  the  problems  addressed  in  distributed  computing 
were  first  introduced  in  the  1950s,  this  topic  is  associated  with  the  1980s  and  1990s. 


©John  E  Savage 


1.2  Mathematical  Preliminaries 


7 


CRYPTOGRAPHY:  While  cryptography  has  been  important  for  ages,  it  became  a  serious  con¬ 
cern  of  complexity  theorists  in  the  late  1970s  and  an  active  research  area  in  the  1980s  and 
1990s.  Some  of  the  important  cryptographic  issues  are  a)  how  to  exchange  information  se¬ 
cretly  without  having  to  exchange  a  private  key  with  each  communicating  agent,  b)  how  to 
identify  with  high  assurance  the  sender  of  a  message,  and  c)  how  to  convince  another  agent 
that  you  have  the  solution  to  a  problem  without  transferring  the  solution  to  him  or  her. 

As  this  brief  history  illustrates,  theoretical  computer  science  speaks  to  many  different  com¬ 
putational  issues.  As  the  range  of  issues  addressed  by  computer  science  grows  in  sophistication, 
we  can  expect  a  commensurate  growth  in  the  richness  of  theoretical  computer  science. 

1.2  Mathematical  Preliminaries 

In  this  section  we  introduce  basic  concepts  used  throughout  the  book.  Since  it  is  presumed 
that  the  reader  has  already  met  most  of  this  material,  this  presentation  is  abbreviated. 

1.2.1  Sets 

A  set  A  is  a  non-repeating  and  unordered  collection  of  elements.  For  example,  Tfyos  = 
{Cobol,  Fortran,  Lisp}  is  a  set  of  elements  that  could  be  interpreted  as  the  names  of  languages 
designed  in  the  1950s.  Because  the  elements  in  a  set  are  unordered,  {Cobol,  Fortran,  Lisp} 
and  {Lisp,  Cobol,  Fortran}  denote  the  same  set.  It  is  very  convenient  to  recognize  the  empty 
set  0,  a  set  that  does  not  have  any  elements.  The  set  B  =  {0,  1}  containing  0  and  1  is  used 
throughout  this  book. 

The  notation  a  G  A  means  that  element  a  is  contained  in  set  A.  For  example,  Cobol  € 
Tfyos  means  that  Cobol  is  a  language  invented  in  the  1950s.  A  set  can  be  finite  or  infinite.  The 
cardinality  of  a  finite  set  A,  denoted  |  A\,  is  the  number  of  elements  in  A.  We  say  that  a  set  A 
is  a  subset  of  a  set  B,  denoted  A  C  B,  if  every  element  of  A  is  an  element  of  B.  If  A  C  B 
but  B  contains  elements  not  in  A,  we  say  that  A  is  a  proper  subset  and  write  A  C  B. 

The  union  of  two  sets  A  and  B,  denoted  A  U  B,  is  the  set  containing  elements  that 
are  in  A,  B  or  both.  For  example,  if  Ao  =  {1, 2,  3}  and  Bq  =  {4,  3,  5},  then  Aq  U  Bq  = 
{5,  4,  3,  1,2}.  The  intersection  of  sets  A  and  B,  denoted  A(1B,  is  the  set  containing  elements 
that  are  in  both  A  and  B.  Hence,  Aq  0  Bq  =  {3}.  If  A  and  B  have  no  elements  in  common, 
denoted  A  0  B  =  0,  they  are  said  to  be  disjoint  sets.  The  difference  between  sets  A  and 
B,  denoted  A  —  B,  is  the  set  containing  the  elements  that  are  in  A  but  not  in  B.  Thus, 
A0  -  B0  =  {1,2}.  (See  Fig.  1.1.) 


Figure  I .  I  A  Venn  diagram  showing  the  intersection  and  difference  of  sets  A  and  B.  Their 
union  is  the  set  of  elements  in  both  A  and  B. 
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The  following  simple  properties  hold  for  arbitrary  sets  A  and  B  and  the  operations  of  set 
union,  intersection,  and  difference: 


Al)B = BU A 
Ad  B  =  B  n  A 
AU0  =  A 
And)  =  0 
A-0  =  A 

The  power  set  of  a  set  A,  denoted  2a,  is  the  set  of  all  subsets  of  A  including  the  empty 
set.  For  example,  2f2,5,9f  =  {0,  {2},  {5},  {9},  {2,  5},  {2, 9},  {5, 9},  {2,  5, 9}}.  We  use  2A 
to  denote  the  power  set  A  as  a  reminder  that  it  has  2^^  elements.  To  see  this,  observe  that 
for  each  subset  B  of  the  set  A  there  is  a  binary  n-tuple  (e\,  e2,  ■  ■  ■ ,  e\A\)  where  e*  is  1  if  the 
*th  element  of  A  is  in  B  and  0  otherwise.  Since  there  are  2\a\  ways  to  assign  0’s  and  Is  to 
(ei,  e2, . . . ,  e|_4|),  2A  has  2^A\  elements. 

The  Cartesian  product  of  two  sets  A  and  B,  denoted  Ax  B,  is  another  set,  the  set  of  pairs 
{(a,  b)  |  a  £  A,  b  £  B}.  For  example,  when  Aq  =  {1,  2,  3}  and  Bq  =  {4,  3,  5},  Aq  X  B0  = 
{(1,4),  (1, 3),  (1,  5),  (2, 4),  (2, 3),  (2,  5),  (3, 4),  (3, 3),  (3,  5)}.  The  Cartesian  product  of  k 
sets  Ai,  A2, .  . . ,  Ak,  denoted  A\  x  A2  x  •  •  -xAk,  is  the  set  of  fc-tuples  {(ai,  <22, . . . ,  ak)  |  a\  £ 
A\,  a,2  €  A2,  . .  . ,  ak  £  Ak }  whose  components  are  drawn  from  the  respective  sets.  If  for 
each  \  <  i  <  k,  Ai  =  A,  the  fc-fold  Cartesian  product  A\  x  Ai  X  •  •  •  x  Ak  is  denoted 
Ak .  An  element  of  Ak  is  a  fc-tuple  (ai,  02, ... ,  ak)  where  ai  £  A.  Thus,  the  binary  n-tuple 
(ei,  ei,  ■  ■  ■ ,  £\a\)  °f  the  preceding  paragraph  is  an  element  of  {0,  1}”. 

1.2.2  Number  Systems 

Integers  are  widely  used  to  describe  problems.  The  infinite  set  IN  consisting  of  0  and  the 
positive  integers  {1,2,  3, . . .}  is  called  the  set  of  natural  numbers.  The  set  of  positive  and 
negative  integers  and  zero,  Zj,  consists  of  the  integers  {0,  1,  —1,  2,  —2, . . .}. 

In  the  standard  decimal  representation  of  the  natural  numbers,  each  integer  n  is  repre¬ 
sented  as  a  sum  of  powers  of  10.  For  example,  867  =  8  x  102  +  6  x  101  +  7  x  10°.  Since 
computers  today  are  binary  machines,  it  is  convenient  to  represent  integers  over  base  2  instead 
of  10.  The  standard  binary  representation  for  the  natural  numbers  represents  each  integer  as 
a  sum  of  powers  of  2.  That  is,  for  some  k  >  0  each  integer  n  can  be  represented  as  a  fc-tuple 
x  =  (xk- 1,  Xk~  2,  •  ■  • ,  X\,  Xq),  where  each  of  Xk~\,  Xk~2>  ■  ■  ■  >X\,Xq  has  value  0  or  1  and  n 
satisfies  the  following  identity: 


n  —  Xk—\2k  1  T  xk-22k  T  •  *  *  -f  X\2}  4-  Xq2^ 

The  largest  integer  that  can  be  represented  with  k  bits  is  2fc_1  +  2k~2  +  •  •  •  +  21  +  2°  = 
2k  —  1.  (See  Problem  1.1.)  Also,  the  A:-tuple  representation  for  n  is  unique;  that  is,  two 
different  integers  cannot  have  the  same  representation.  When  leading  0’s  are  suppressed,  the 
standard  binary  representation  for  1,  15,  32,  and  97  are  (1),  (1, 1,  1, 1),  (1,  0,  0,  0,  0,  0),  and 
(1,  1, 0,  0,  0,  0,  1),  respectively. 

We  denote  with  x  +  y,  x  —  y,  x  *  y,  and  x/y  the  results  of  addition,  subtraction,  multi¬ 
plication,  and  division  of  integers. 
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1.2.3  Languages  and  Strings 

An  alphabet  A  is  a  finite  set  with  at  least  two  elements.  A  string  x  is  an  element  (ai ,  a2, . . . ,  ak) 
of  the  Cartesian  product  Ak  in  which  we  drop  the  commas  and  parentheses.  Thus,  we  write 
x  =  a\a,2  ■  ■  ■  ak,  and  say  that  a;  is  a  string  over  the  alphabet  A.  A  string  x  in  Ak  is  said  to 
have  length  k,  denoted  \x\  =  k.  Thus,  01 1  is  a  string  of  length  three  over  A  =  {0,  1}. 

Consider  now  the  Cartesian  product  Ak  X  A1  =  Ak+l ,  which  is  the  ( k  +  ©fold  Cartesian 
product  of  A  with  itself.  Let  x  =  a\a2  ■  ■  ■  ak  £  Ak  and  y  =  6162  •  •  •  bi  £  A1 .  Then  a  string 
z  =  C1C2  •  •  •  Ck+i  £  Ak+l  can  be  written  as  the  concatenation  of  strings  x  and  y  of  length  k 
and  l,  denoted,  z  =  x  ■  y,  where 

x  y  =  axa2  ■  ■  ■  akb\b2  •  •  •  6; 

That  is,  Ci  =  at  for  1  <  i  <  k  and  c,  =  6;_fc  for  k  +  1  <  i  <  k  +  l. 

The  empty  string,  denoted  e,  is  a  special  string  with  the  property  that  when  concatenated 
with  any  other  string  x  it  returns  x;  that  is,  x  ■  e  —  e  ■  x  =  x.  The  empty  string  is  said  to  have 
zero  length.  As  a  special  case  of  Ak,  we  let  A0  denote  the  set  containing  the  empty  string; 
that  is,  A0  =  {e}. 

The  concatenation  of  sets  of  strings  A  and  B,  denoted  A  ■  B,  is  the  set  of  strings  formed 
by  concatenating  each  string  in  A  with  each  string  in  B.  For  example,  {00,  1}  ■  {a,  bb}  = 
{00a,  00  bb,  la,  166}.  The  concatenation  of  a  set  A  with  the  empty  set  0,  denoted  A  ■  0,  is  the 
empty  set  because  it  contains  no  elements;  that  is, 

A  •  0  =  0  •  A  =  0 

When  no  confusion  arises,  we  write  AB  instead  of  A  ■  B. 

A  language  L  over  an  alphabet  A  is  a  collection  of  strings  of  potentially  different  lengths 
over  A.  For  example,  {00,  010, 1 110,  1001}  is  a  finite  language  over  the  alphabet  {0, 1}.  (It 
is  finite  because  it  contains  a  bounded  number  of  strings.)  The  set  of  all  strings  of  all  lengths 
over  the  alphabet  A,  including  the  empty  string,  is  denoted  A*  and  called  the  Kleene  closure 
of  A.  For  example,  {0}*  contains  e,  the  empty  string,  as  well  as  0,  00,  000,  0000,  ....  Also, 
{00  U  1}*  =  {e,  1,  00,  001, 100,  0000, . . .}.  It  follows  that  a  language  L  over  the  alphabet  A 
is  a  subset  of  A*,  denoted  L  C  A*. 

The  positive  closure  of  a  set  A,  denoted  A+,  is  the  set  of  all  strings  over  A  except  for 
the  empty  string.  For  example,  0(0*10*)  +  is  the  set  of  binary  strings  beginning  with  0  and 
containing  at  least  one  1 . 

1.2.4  Relations 

A  subset  R  of  the  Cartesian  product  of  sets  is  called  a  relation.  A  binary  relation  11  is  a 
subset  of  the  Cartesian  product  of  two  sets.  Three  examples  of  binary  relations  are  f?o  = 
{(0,  0),  (1, 1),  (2, 4),  (3,  9),  (4,  16)},  R\  =  {(red,  0),  (green,  1),  (blue,  2)},  and  R2  = 
{(small,  short),  (medium,  middle),  (medium,  average),  (large,  tall)}.  The  relation  R0  is  a 
function  because  for  each  first  component  of  a  pair  there  is  a  unique  second  component.  R\ 
is  also  a  function,  but  R2  is  not  a  function. 

A  binary  relation  R  over  a  set  A  is  a  subset  of  A  X  A;  that  is,  both  components  of  each 
pair  are  drawn  from  the  same  set.  We  use  two  notations  to  denote  membership  of  a  pair  (a,  6) 
in  a  binary  relation  R  over  A,  namely  (a,  6)  £  R  and  the  new  notation  aRb.  Often  it  is  more 
convenient  to  say  aRb  than  to  say  (a,  6)  £  R. 
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A  binary  relation  R  is  reflexive  if  for  all  a  £  A,  aRa.  It  is  symmetric  if  for  all  a,b  £  A, 
aRb  if  and  only  if  bRa.  It  is  transitive  if  for  all  a,b,c£  A,  if  aRb  and  bRc,  then  aRc. 

A  binary  relation  R  is  an  equivalence  relation  if  it  is  reflexive,  symmetric,  and  transitive. 
For  example,  the  pairs  (a,  b),  a,  b  £  IN,  for  which  both  a  and  b  have  the  same  remainder  on 
division  by  3,  is  an  equivalence  relation.  (See  Problem  1.3.) 

If  R  is  an  equivalence  relation  and  aRb,  then  a  and  b  are  said  to  be  equivalent  elements. 
We  let  E[a ]  be  the  set  of  elements  in  A  that  are  equivalent  to  a  under  the  relation  R  and 
call  it  the  equivalence  class  of  elements  equivalent  to  a.  It  is  not  difficult  to  show  that  for  all 
a,b  £  A,  E[a]  and  E[b]  are  either  equal  or  disjoint.  (See  Problem  1.4.)  Thus,  the  equivalence 
classes  of  an  equivalence  relation  over  a  set  A  partition  the  elements  of  A  into  disjoint  sets. 
For  example,  the  partition  {0*,  0(0*  10*)+,  1(0  +  1)*}  of  the  set  (0  +  1)*  of  binary  strings 
defines  an  equivalence  relation  R.  The  equivalence  classes  consist  of  strings  containing  zero  or 
more  0’s,  strings  starting  with  0  and  containing  at  least  one  1,  and  strings  beginning  with  1.  It 
follows  that  00f?000  and  1001  A?1 1  hold  but  not  10-R01. 

1.2.5  Graphs 

A  directed  graph  G  =  (V,E)  consists  of  a  finite  set  V  of  distinct  vertices  and  a  finite  set 
of  pairs  of  distinct  vertices  E  C  V  x  V  called  edges.  Edge  e  is  incident  on  vertex  v  if  e 
contains  v.  A  directed  graph  is  undirected  if  for  each  edge  (v\,V2)  in  E  the  edge  (v2,V\) 
is  also  in  E.  Figure  1.2  shows  two  examples  of  directed  graphs,  some  of  whose  vertices  are 
labeled  with  symbols  denoting  gates,  a  topic  discussed  in  Section  1 .2.7.  In  a  directed  graph 
the  edge  (v i,  V2)  is  directed  from  the  vertex  v\  to  the  vertex  V2,  shown  with  an  arrow  from  iq 
to  t’2 .  The  in-degree  of  a  vertex  in  a  directed  graph  is  the  number  of  edges  directed  into  it;  its 
out-degree  is  the  number  of  edges  directed  away  from  it;  its  degree  is  the  sum  of  its  in-  and 
out-degree.  In  a  directed  graph  an  input  vertex  has  in-degree  zero,  whereas  an  output  vertex 
either  has  out-degree  zero  or  is  simply  any  vertex  specially  designated  as  an  output  vertex.  A 
walk  in  a  graph  (directed  or  undirected)  is  a  tuple  of  vertices  (iq ,  V2,  ■  ■  ■ ,  vp)  with  the  property 
that  (vi,  Vi+ 1)  is  in  E  for  1  <  i  <  p  —  1.  A  walk  (vi,  V2,  ■  ■  . ,  vp)  is  closed  if  =  vp.  A  path 
is  a  walk  with  distinct  vertices.  A  cycle  is  a  closed  walk  with  p  —  1  distinct  vertices,  p  >  3. 
The  length  of  a  path  is  the  number  of  edges  on  the  path.  Thus,  the  path  (ui,  V2,  ■  ■  ■ ,  Vp)  has 
length  p  —  1.  A  directed  acyclic  graph  (DAG)  is  a  directed  graph  that  has  no  cycles. 
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Logic  circuits  are  DAGs  in  which  all  vertices  except  input  vertices  carry  the  labels  of  gates. 
Input  vertices  carry  the  labels  of  Boolean  variables,  variables  assuming  values  over  the  set 
B  =  {0,  1}.  The  graph  of  Fig.  1.2(a)  is  the  logic  circuit  of  Fig.  1.3(c),  whereas  the  graph 
of  Fig.  1.2(b)  is  the  logic  circuit  of  Fig.  1.4.  (The  figures  are  shown  in  Section  1.4.1,  Logic 
Circuits.)  The  set  of  labels  oflogic  gates  used  in  a  DAG  is  called  the  basis  O  for  the  DAG.  The 
size  of  a  circuit  is  the  number  of  non-input  vertices  that  it  contains.  Its  depth  is  the  length  of 
the  longest  directed  path  from  an  input  vertex  to  an  output  vertex. 


1.2.6  Matrices 

An  to  X  n  matrix  is  an  array  of  elements  containing  to  rows  and  n  columns.  (See  Chapter  6.) 
The  adjacency  matrix  of  a  graph  G  with  n  vertices  is  an  n  x  n  matrix  whose  entries  are  0  or 
1.  The  entry  in  the  ith  row  and  j  th  column  is  1  if  there  is  an  edge  from  vertex  i  to  vertex  j 
and  0  otherwise.  The  adjacency  matrix  A  for  the  graph  in  Fig.  1.2(a)  is 


A  = 


0  0  10  0 
0  0  10  0 
0  0  0  0  1 
0  0  0  0  1 
0  0  0  0  0 


1.2.7  Functions 

The  engineering  component  of  computer  science  is  concerned  with  the  design,  development, 
and  testing  of  hardware  and  software.  The  theoretical  component  is  concerned  with  questions 
of  feasibility  and  optimality.  For  example,  one  might  ask  if  there  exists  a  program  H  that  can 
determine  whether  an  arbitrary  program  P  on  an  arbitrary  input  I  will  halt  or  not.  This  is 
an  example  of  an  unsolvable  computational  problem.  While  it  is  a  fascinating  topic,  practice 
often  demands  answers  to  less  ethereal  questions,  such  as  “Can  a  particular  problem  be  solved 
on  a  general-purpose  computer  with  storage  space  S  in  T  steps?” 

To  address  feasibility  and  optimality  it  is  important  to  have  a  precise  definition  of  the  tasks 
under  examination.  Functions  serve  this  purpose.  A  function  (or  mapping)  /  :  2?  i— >  7?.  is 
a  relation  /  on  the  Cartesian  product  T>  x  1Z  subject  to  the  requirement  that  for  each  d  £  V 
there  is  at  most  one  pair  ( d ,  r)  in  /.  If  (d,  r)  £  /,  we  say  that  the  value  of  /  on  d  is  r,  denoted 
f(d)  =  r.  The  domain  and  codomain  of  /  are  T>  and  1Z,  respectively.  The  sets  T>  and  1Z  can 
be  finite  or  infinite.  For  example,  let  /muit  :  IN2  N  of  domain  T>  =  IN2  and  codomain 
1Z  =  IN  map  a  pair  of  natural  numbers  x  and  y  (IN  =  {0,  1, 2,  3, . . .})  into  their  product  z; 
that  is,  f(x ,  y)  =  z  =  x  *  y.  A  function  /  :  V  i— >  7Z  is  partial  if  for  some  d  £  V  no  value 
in  1Z  is  assigned  to  /(d).  Otherwise,  a  function  is  complete. 

If  the  domain  of  a  function  is  the  Cartesian  product  of  n  sets,  the  function  is  said  to  have 
n  input  variables.  If  the  codomain  of  a  function  is  the  Cartesian  product  of  to  sets,  the 
function  is  said  to  have  to  output  variables.  If  the  input  variables  of  such  a  function  are  all 
drawn  from  the  set  A  and  the  output  variables  are  all  drawn  from  the  set  B,  this  information 
is  often  captured  by  the  notation  /("'m)  :  An  i— >  Bm.  However,  we  frequently  do  not  use 
exponents  or  we  use  only  one  exponent  to  parametrize  a  class  of  problems. 
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A  finite  function  is  one  whose  domain  and  codomain  are  both  finite  sets.  Finite  functions 
can  be  completely  defined  by  tables  of  pairs  {( d ,  r)},  where  d  is  an  element  of  its  domain  and 
r  is  the  corresponding  element  of  its  codomain. 

Binary  functions  are  complete  finite  functions  whose  domains  and  codomains  are  Carte¬ 
sian  products  over  the  binary  set  B  =  {0,  1}.  Boolean  functions  are  binary  functions  whose 
codomain  is  B.  The  tables  below  define  three  Boolean  functions  on  two  input  variables  and 
one  Boolean  function  on  one  input  variable.  They  are  called  truth  tables  because  the  values  1 
and  0  are  often  associated  with  the  values  True  and  False,  respectively. 


X 

y 

x  Ay 

X 

y 

x  V  y 

X 

y 

x  ®  y 

X 

X 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

l 

0 

l 

0 

0 

l 

1 

0 

i 

1 

i 

0 

i 

0 

0 

1 

0 

1 

i 

0 

1 

i 

l 

1 

1 

i 

1 

i 

i 

0 

The  above  tables  define  the  AND  function  x  Ay  (its  value  is  True  when  x  and  y  are  True), 
the  OR  function  xW  y  (its  value  is  True  when  either  x  or  y  or  both  are  True),  the  EXCLUSIVE 
OR  function  x  ©  y  (its  value  is  True  only  when  either  x  or  y  is  True,  that  is,  when  x  is 
True  and  y  is  False  and  vice  versa),  and  the  NOT  function  x  (its  value  is  True  when  x  is 
False  and  vice  versa).  The  notation  / ^2’1^  :  B2  ^  B,  /y2' ' 1  :  B2  i— >  B,  /q2’1'*  :  B1  t— >  B, 
/-V :  B  <—>  B  ior  these  functions  makes  explicit  their  number  of  input  and  output  variables. 
We  generally  suppress  the  second  superscript  when  functions  are  Boolean.  The  physical  devices 
that  compute  the  AND,  OR,  NOT,  and  EXCLUSIVE  OR  functions  are  called  gates. 

Many  computational  problems  are  described  by  functions  /  :  A*  i— >  C*  from  the  (un¬ 
bounded)  set  of  strings  over  an  alphabet  A  to  the  set  of  strings  over  a  potentially  different 
alphabet  C.  Since  the  letters  in  every  finite  alphabet  A  can  be  encoded  as  fixed-length  strings 
over  the  binary  alphabet  B  =  {0,  1},  there  is  no  loss  of  generality  in  assuming  that  functions 
are  mappings  /  :  B*  i— »  B* ,  that  is,  from  strings  over  B  to  strings  over  B. 

Functions  with  unbounded  domains  can  be  used  to  identify  languages.  A  language  L  over 
the  alphabet  A  is  uniquely  determined  by  a  characteristic  function  /  :  A*  i— >  B  with  the 
property  that  L  =  {x  \  x  £  A*  such  that  f(x)  =  1}-  This  statement  means  that  L  is  the  set 
of  strings  x  in  A*  for  which  /  on  them,  namely  f{x),  has  value  1. 

We  often  restrict  a  function  /  :  B*  i— >  B*  to  input  strings  of  length  n,  n  arbitrary.  The 
domain  of  such  a  function  is  Bn .  Its  codomain  consists  of  those  strings  into  which  strings  of 
length  n  map.  This  set  may  contain  strings  of  many  lengths.  It  is  often  convenient  to  map 
strings  of  length  n  to  strings  of  a  fixed  length  containing  the  same  information.  This  can  be 
done  as  follows.  Let  h{n)  be  the  length  of  a  longest  string  that  is  the  value  of  an  input  string 
of  length  n.  Encode  letters  in  B  by  repeating  them  (replace  0  by  00  and  1  by  1 1)  and  then  add 
as  a  prefix  as  many  instances  of  01  as  necessary  to  insure  that  each  string  in  the  codomain  of 
fn  has  2 h(n)  characters.  For  example,  if  h(4)  =  3  and  /(01 10)  =  10,  encode  the  value  10  as 
011  100.  This  encoding  provides  a  function  fn  :  Bn  i— >  B2h ^  containing  all  the  information 
that  is  in  the  original  version  of  /„. 

It  is  often  useful  to  work  with  functions  /  :  IR  i— >  IR  whose  domains  and  codomains  are 
real  numbers  IR.  Functions  of  this  type  include  linear  functions,  polynomials,  exponentials, 
and  logarithms.  A  polynomial  p(x)  :  IR  i— >  IR  of  degree  fc  —  1  in  the  variable  x  is  specified 
by  a  set  of  k  real  coefficients,  Ck- i,  . . .  ,C\,Cq,  where  p(x)  =  Ck-\Xk~l  +  ■  ■  ■  +  C\XX  +  Cq. 
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A  linear  function  is  a  polynomial  of  degree  1 .  An  exponential  function  is  a  function  of  the 
form  E(x)  =  ax  for  some  real  a  —  for  example,  21'5  =  2.8284271  ....  The  logarithm  to  the 
base  a  of  b,  denoted  loga  b,  is  the  value  of  x  such  that  ax  =  b.  For  example,  the  logarithm  to 
base  2  of  2.8284271  ...  is  1.5  and  the  logarithm  to  base  10  of  100  is  2.  A  function  f(x)  is 
polylogarithmic  if  for  some  polynomial  p(x)  we  can  write  f(x)  as  p(log2  x)-,  that  is,  it  is  a 
polynomial  in  the  logarithm  of  X. 

Two  other  functions  used  often  in  this  book  are  the  floor  and  ceiling  functions.  Their 
domains  are  the  reals,  but  their  codomains  are  the  integers.  The  ceiling  function,  denoted 
\x]  :  IR  Z,  maps  the  real  x  to  the  smallest  integer  greater  or  equal  to  it.  The  floor 
function,  denoted  l©  :  IR,  i— >•  H,  maps  the  real  x  to  the  largest  integer  less  than  or  equal  to 
it.  Thus,  [3.5]  =  4  and  [15.0001]  =  16.  Similarly,  [3.5]  =  3  and  [15.0001]  =  15.  The 
following  bounds  apply  to  the  floor  and  ceiling  functions. 

/O)  -  1  <  [f(x)\  <  f(x) 

/0*0  <  \f(x)]  <  f(x)  +  1 

As  an  example  of  the  application  of  the  ceiling  function  we  note  that  [log2  n\  is  the  number 
of  bits  necessary  to  represent  the  integer  n. 

1.2.8  Rate  of  Growth  of  Functions 

Throughout  this  book  we  derive  mathematical  expressions  for  quantities  such  as  space,  time, 
and  circuit  size.  Generally  these  expressions  describe  functions  /  :  IN  i— >  ]R  from  the  non¬ 
negative  integers  to  the  reals,  such  as  the  functions  f  (n)  and  /2(n)  defined  as 

/i(n)  =  4.5n2  +  3  n 
/2(„)  =  3n  +  4.5n2 

When  n  is  large  we  often  wish  to  simplify  expressions  such  as  these  to  make  explicit  their 
dominant  or  most  rapidly  growing  term.  For  example,  for  large  values  of  n  the  dominant  terms 
in  /i(n)  and  /2(n)  are  4.5 n1  and  3"  respectively,  as  we  show.  A  term  dominates  when  n  is 
large  if  the  value  of  the  function  is  approximately  the  value  of  this  term,  that  is,  if  the  function 
is  within  some  multiplicative  factor  of  the  term. 

To  highlight  dominant  terms  we  introduce  the  big  Oh,  big  Omega  and  big  Theta  no¬ 
tation.  They  are  defined  for  functions  whose  domains  and  codomains  are  the  integers  or  the 
reals. 

DEFINITION  1.2.1  Let  f  :  U  h->  IR  and  g  :  U  ^  U  be  two  fimctions  whose  domains  and 
codomains  are  either  the  integers  or  the  reals.  If  there  are  positive  constants  Xo  and  K  >  0  such 
that  for  all  \x\  >  xq, 

1/0*01  <  K\g{x)\ 

we  write 


f  0*0  =  0(g(x)) 

and  say  that  “f  (x)  is  big  Oh  of  g(x)  ’’  or  it  grows  no  more  rapidly  in  x  than  g(x) .  Under  the 
same  conditions  we  also  write 


50*0  =  °(/0*0) 
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and  say  that  “gix)  is  big  Omega  of  /  (x)  ”  or  that  it  grows  at  least  as  rapidly  in  x  as  fix). 

If  f{x)  =  0(g{x))  and g(x)  =  0{f{ x)),  we  write 

fix)  =  Gig(x))  orgix )  =  0(/(x)) 

and  say  that  “f  (x)  is  big  Theta  of  gix)  ”  and  “gix)  is  big  Theta  of  / (x)  ”  or  that  the  two 
functions  have  the  same  rate  of  growth  in  x. 

The  big  Oh  notation  is  illustrated  by  the  expressions  for  f\  (n)  and  fiin)  above. 

EXAMPLE  1.2.1  We  show  that  /i(n)  =  4.5 n2  +  3 n  is  0{nk )  for  any  k  >  2;  that  is,  /i(n) 
grows  no  more  rapidly  than  nk  for  k  >  2.  We  also  show  that  nk  =  0(/i  in))  for  k  <  2;  that 
is,  that  nk  grows  no  more  rapidly  than  f\  in)  for  k  <  2.  From  the  above  definitions  it  follows 
that  f  in)  =  @(n2);  that  is,  f\  in)  and  n2  have  the  same  rate  of  growth.  We  say  that  fl  (n)  is  a 

quadratic  function  in  n. 

To  prove  the  first  statement,  we  need  to  exhibit  a  natural  number  n o  and  a  constant  Kq  >  0 
such  that  for  all  n  >  no,  /i(n)  <  Konk.  If  we  can  show  that  f  fin)  <  Kon2,  then  we  have 
shown  f\  fid)  <  Konk  for  all  k  >  2.  To  show  the  former,  we  must  show  the  following  for  some 
Kq  >  0  and  for  all  n  >  Hq: 

4.5 n2  +  3 n  <  K0n 2 

We  try  K0  =  5.5  and find  that  the  above  inequality  is  equivalent  to  3n  <  n2  or  3  <  n.  Thus,  we 
can  choose  n o  =  3  and  we  are  done. 

To  prove  the  second  statement,  namely,  that  nk  =  0(/i(n))  fork  <  2,  we  must  exhibit  a 
natural  number  n\  and  some  K\  >  0  such  that  for  all  n  >  n\,  nk  <  K2fiin).  If  we  can  show 
thatn 2  <  Kxfiin),  then  ive  have  shown  nk  <  ^2/1  (n).  To  show  the  former,  we  must  show  the 
following  for  some  K\  >  0  and  for  all  n  >  ri] : 

n2  <  K\(4.5n2  +  3 n) 

Clearly,  if  K\  =  1/4.5  the  inequality  holds  for  n  >  0,  since  3K\n  is  positive.  Thus,  we  choose 
n\  =  0  and  we  are  done. 

EXAMPLE  1.2.2  We  now  show  that  the  slightly  ?nore  complex  function  fiin)  =  3n  +  4.5 n2 
grows  as  3n;  that  is,  fiin)  =  @(3n),  an  exponential  fknction  in  n.  Because  3n  <  fiin)  for 
alln  >  0,  it  follows  that  3n  =  0(/2(n)).  To  show  that  .//(n)  =  0(3"),  we  demonstrate  that 
flirt)  <  2(3ra)  holds  for  n  >  4.  This  is  equivalent  to  the  following  inequality: 

4.5n2  <  3n 

To  prove  this  holds,  we  show  that  h(n)  =  3 n /n2  is  an  increasing  function  of  n  for  n  >  2 
and  that  hi  4)  >  4.5.  To  show  that  h(n)  is  an  increasing  function  ofn,  we  compute  the  ratio 
r’(n)  =  hfii  +  l)//i(n)  and  show  that  r  (n)  >  1  for  n  >  2.  Butrin)  =  3  n2/(n  +  l)2  and 
rirt)  >  1  when  3  n2  >  (n  +  l)2  or  when  nfii  —  1)  >  1/2,  which  holds  for  n  >  2.  Since 
h( 3)  =  3  and  h(4)  =  81/16  >  5,  the  desired  conclusion  follows. 

1.3  Methods  of  Proof 

In  this  section  we  briefly  introduce  several  methods  of  proof  that  are  used  in  this  book,  namely, 
proof  by  induction,  proof  by  contradiction,  and  the  pigeonhole  principle.  In  the  previous 
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section  we  saw  proof  by  reduction:  in  each  step  the  condition  to  be  established  was  translated 
into  another  condition  until  a  condition  was  found  that  was  shown  to  be  true. 

Proofs  by  induction  use  predicates,  that  is,  functions  of  the  kind  P  :  IN  i— >  B.  The 
truth  value  of  the  predicate  P  :  IN  i— >  B  on  the  natural  number  n,  denoted  P[n),  is  1  or  0 
depending  on  whether  or  not  the  predicate  is  True  or  False. 

Proofs  by  induction  are  used  to  prove  statements  of  the  kind,  “For  all  natural  numbers 
n,  predicate  (or  property)  P  is  true.”  Consider  the  function  S\  :  IN  i— >  IN'  defined  by  the 
following  sum: 


Si(»)  =  X>'  a-i) 

j=i 

We  use  induction  to  prove  that  S\  (n)  =  n(n  +  l)/2  is  true  for  each  n  €  IN. 

DEFINITION  1.3.1  A  proof  by  induction  has  a  predicate  P,  a  basis  step,  an  induction  hy¬ 
pothesis,  and  an  inductive  step.  The  basis  establishes  that  P(k)  is  true  for  integer  k.  The 
induction  hypothesis  assumes  that  for  some  fixed  but  arbitrary  natural  number  n  >  k,  the  state¬ 
ments  P{k),  P(k  +  1), . . . ,  P(n)  are  true.  The  inductive  step  is  a  proof  that  P(yn  +  1)  is  true 
given  the  induction  hypothesis. 

It  follows  from  this  definition  that  a  proof  by  induction  with  the  predicate  P  establishes 
that  P  is  true  for  all  natural  numbers  larger  than  or  equal  to  k  because  the  inductive  step 
establishes  the  truth  of  P(n  +  1)  for  arbitrary  integer  n  greater  than  or  equal  to  k.  Also, 
induction  may  be  used  to  show  that  a  predicate  holds  for  a  subset  of  the  natural  numbers.  For 
example,  the  hypothesis  that  every  even  natural  number  is  divisible  by  2  is  one  that  would  be 
defined  only  on  the  even  numbers. 

The  following  proof  by  induction  shows  that  S\(n )  =  n(n  +  l)/2  for  n  >  0. 

LEMMA  1 .3. 1  For  all  n  >  0,  S\{n)  =  n(n  +  l)/2. 

Proof  PREDICATE:  The  value  of  the  predicate  P  on  n,  P(n),  is  True  if  S\(n)  =  n(n  + 
l)/2  and  False  otherwise. 

BASIS  STEP:  Clearly,  <Si(0)  =  0  from  both  the  sum  and  the  closed  form  given  above. 

INDUCTION  HYPOTHESIS:  Si  (k)  =  k(k  +  l)/2  for  k  =  0, 1, 2, . . . ,  n. 

INDUCTIVE  STEP:  By  the  definition  of  the  sum  for  S\  given  in  (1. 1),  (n+ 1)  =  S\(n)  + 

n  +  1.  Thus,  it  follows  that  S\  (n  +  1)  =  n(n  +  l)/2  +  n  +  1.  Factoring  out  n  +  1  and 
rewriting  the  expression,  we  have  that  Si(n+  1)  =  (n  +  l)((n+  1)  +  l)/2,  exactly  the 
desired  form.  Thus,  the  statement  of  the  theorem  follows  for  all  values  of  n.  ■ 

We  now  define  proof  by  contradiction. 

DEFINITION  1 .3.2  A  proof  by  contradiction  has  a  predicate  P.  The  complement  —>P  of  P  is 

shown  to  be  False,  which  implies  that  P  is  True. 

The  examples  shown  earlier  of  strings  in  the  language  L  =  {00  U  1}*  suggest  that  L 
contains  only  strings  other  than  e  with  an  odd  number  of  l’s.  Let  P  be  the  predicate  “L 
contains  strings  other  than  e  with  an  even  number  of  Is.”  We  show  that  it  is  true  by  assuming 
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it  is  false,  namely,  by  assuming  “L  contains  only  strings  with  an  odd  number  of  l’s”  and 
showing  that  this  statement  is  false.  In  particular,  we  show  that  L  contains  the  string  11.  From 
the  definition  of  the  Kleene  closure,  L  contains  strings  of  all  lengths  in  the  “letters”  00  and  1 . 
Thus,  it  contains  a  string  containing  two  instances  of  1  and  the  predicate  P  is  true. 

Induction  and  proof  by  contradiction  can  also  be  used  to  establish  the  pigeonhole  principle. 
The  pigeonhole  principle  states  that  if  there  are  n  pigeonholes,  n  +  1  or  more  pigeons,  and 
every  pigeon  occupies  a  hole,  then  some  hole  must  have  at  least  two  pigeons.  We  reformulate 
the  principle  as  follows: 

LEMMA  1 .3.2  Given  two  finite  sets  A  and  B  with  |j4|  >  \B\,  there  does  not  exist  a  naming 
function  v  :  A  i— >  B  that  gives  to  each  element  a  in  A  a  name  v(a)  in  B  such  that  every  element 
in  A  has  a  unique  name. 

Proof  BASIS:  \B\  =  1.  To  show  that  the  statement  is  True,  assume  it  is  False  and  show 
that  a  contradiction  occurs.  If  it  is  False,  every  element  in  A  can  be  given  a  unique  name. 
However,  since  there  is  one  name  (the  one  element  of  B )  and  more  than  one  element  in  A, 
we  have  a  contradiction. 

INDUCTION  HYPOTHESIS:  There  is  no  naming  function  v  :  A  i— »  B  when  \B\  <  n  and 
\A\>\B\. 

INDUCTIVE  STEP:  When  \B\  =  n+  1  and  |  A|  >  \B\  we  show  there  is  no  naming  function 
v  :  A  i— >  B.  Consider  an  element  b  C  B.  If  two  elements  of  A  have  the  name  b,  the  desired 
conclusion  holds.  If  not,  remove  b  from  B,  giving  the  set  B' ,  and  remove  from  A  the 
element,  if  any,  whose  name  is  b,  giving  the  set  A'.  Since  \A'\  >  \B'\  and  j  B'\  <  n,  by  the 
induction  hypothesis,  there  is  no  naming  function  obtained  by  restricting  v  to  A! .  Thus, 
the  desired  conclusion  holds.  ■ 

1.4  Computational  Models 

A  variety  of  computer  models  are  examined  in  this  book.  In  this  section  we  give  the  reader 
a  taste  of  five  models,  the  logic  circuit,  the  finite-state  machine,  the  random-access  machine, 
the  pushdown  automaton,  and  the  Turing  machine.  We  also  briefly  survey  the  problem  of 
language  recognition. 

1.4.1  Logic  Circuits 

A  logic  gate  is  a  physical  device  that  realizes  a  Boolean  function.  A  logic  circuit,  as  defined 
in  Section  1.2,  is  a  directed  acyclic  graph  in  which  all  vertices  except  input  vertices  carry  the 
labels  of  gates. 

Logic  gates  can  be  constructed  in  many  different  technologies.  To  make  ideas  concrete, 
Fig.  1 .3(a)  and  (b)  show  electrical  circuits  for  the  AND  and  OR  gates  constructed  with  batteries, 
bulbs,  and  switches.  Shown  with  each  of  these  circuits  is  a  logic  symbol  for  the  gate.  These 
symbols  are  used  to  draw  circuits,  such  as  the  circuit  of  Fig.  1 .3(c)  for  the  function  {xd  y)  A  z. 
When  electrical  current  flows  out  of  the  batteries  through  a  switch  or  switches  in  these  circuits, 
the  bulbs  are  lit.  In  this  case  we  say  the  value  of  the  circuit  is  True;  otherwise  it  is  False.  Shown 
below  is  the  truth  table  for  the  function  mapping  the  values  of  the  three  input  variables  of  the 
circuit  in  Fig.  1.3(c)  to  the  value  of  the  one  output  variable.  Here  x,  y,  and  z  have  value  1 
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(a) 


(b) 


Figure  1 .3  Three  electrical  circuits  simulating  logic  circuits. 


(c) 


when  the  switch  that  carries  its  name  is  closed;  otherwise  they  have  value  0. 
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Today’s  computers  use  transistor  circuits  instead  of  the  electrical  circuits  of  Fig.  1.3. 

Logic  circuits  execute  straight-line  programs,  programs  containing  only  assignment  state¬ 
ments.  Thus,  they  have  no  loops  or  branches.  (They  may  have  loops  if  the  number  of  times 
a  loop  is  executed  is  fixed.)  This  point  is  illustrated  by  the  “full-adder”  circuit  of  Fig.  1 .4, 
a  circuit  discussed  at  length  in  Section  2.7.  Each  external  input  and  each  gate  is  assigned  a 
unique  integer.  Each  is  also  assigned  a  variable  whose  value  is  the  value  of  the  external  input 
or  gate.  The  ith  vertex  is  assigned  the  variable  X;.  If  Xj  is  associated  with  a  gate  that  combines 
the  results  produced  at  the  jth  and  fcth  gates  with  the  operator  0,  we  write  an  assignment 
operation  of  the  form  Xj  :=  Xj  0  Xfc.  The  sequence  of  assignment  operations  for  a  circuit  is 
a  straight-line  program.  Below  is  a  straight-line  program  for  the  circuit  of  Fig.  1 .4: 


X4 

=  X\ 

© 

Xy 

x5 

=  X4 

A 

X3 

x6 

=  Xi 

A 

Xy 

Xy 

=  X4 

0 

Xi 

xg 

=  Xi 

V 

X6 

The  values  computed  for  (xg,  Xy)  are  the  standard  binary  representation  for  the  number  of  l’s 
among  xi,  Xy,  and  X3.  This  can  be  seen  by  constructing  a  table  of  values  for  X\,  Xy,  X3,  Xy, 


18 


Chapter  I  The  Role  of  Theory  in  Computer  Science 


Models  of  Computation 


Figure  1.4  A  full-adder  circuit.  Its  output  pair  (xs.Xy)  is  the  standard  binary  representation 
for  the  number  of  Is  among  its  three  inputs  X\,  Xj ,  and  *3. 


and  Xg.  Full-adder  circuits  can  be  combined  to  construct  an  adder  for  binary  numbers.  (In 
Section  2.2  we  give  another  notation  for  straight-line  programs.) 

As  shown  in  the  truth  table  for  Fig.  1.3(c),  each  logic  circuit  has  associated  with  it  a  binary 
function  that  maps  the  values  of  its  input  variables  to  the  values  of  its  output  variables.  In  the 
case  of  the  full-adder,  since  Xs  and  Xy  are  its  output  variables,  we  associate  with  it  the  function 
:  $3  1— >  B2,  whose  value  is  fp^\x\,  Xy,  £3)  =  (xg,Xy). 

Algebraic  circuits  are  similar  to  logic  circuits  except  they  may  use  operations  over  non¬ 
binary  sets,  such  as  addition  and  multiplication  over  a  ring,  a  concept  explained  in  Sec¬ 
tion  6.2.1.  Algebraic  circuits  are  the  subject  of  Chapter  6.  They  are  also  described  by  DAGs 
and  they  execute  straight-line  programs  where  the  operators  are  non-binary  functions.  Alge¬ 
braic  circuits  also  have  associated  with  them  functions  that  map  the  values  of  inputs  to  the 
values  of  outputs. 

Logic  circuits  are  the  basic  building  blocks  of  all  digital  computers  today.  When  such 
circuits  are  combined  with  binary  memory  cells,  machines  with  memory  can  be  constructed. 
The  models  for  these  machines  are  called  finite-state  machines. 


1.4.2  Finite-State  Machines 

The  finite-state  machine  (FSM)  is  a  machine  with  memory.  It  executes  a  series  of  steps  during 
each  of  which  it  takes  its  current  state  from  the  set  Q  of  states  and  current  external  input  from 
the  set  S  of  input  letters  and  combines  them  in  a  logic  circuit  L  to  produce  a  successor  state 
in  Q  and  an  output  letter  in  T,  as  suggested  in  Fig.  1.5.  The  logic  circuit  L  can  be  viewed  as 
having  two  parts,  one  that  computes  the  next-state  function  S  :  Q  x  £  1— >  Q,  whose  value 
is  the  next  state  of  the  FSM,  and  the  other  that  computes  the  output  function  A  :  Q  1— >  T, 
whose  value  is  the  output  of  the  FSM  in  the  current  state.  A  generic  finite-state  machine  is 
shown  in  Fig.  1.5(a)  along  with  a  concrete  FSM  in  Fig.  1.5(b)  that  provides  as  successor  state 
and  output  the  EXCLUSIVE  OR  of  the  current  state  and  the  external  input.  The  state  diagram 
of  the  FSM  in  Fig.  1.5(b)  is  shown  in  Fig.  1.8.  Two  (or  more)  finite-state  machines  that  operate 
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(a)  (b) 

Figure  1 .5  (a)  The  finite-state  machine  (FSM)  model;  at  each  unit  of  time  its  logic  unit,  L, 
operates  on  its  current  state  (taken  from  its  memory)  and  its  current  external  input  to  compute  an 
external  output  and  a  new  state  that  it  stores  in  its  memory,  (b)  An  FSM  that  holds  in  its  memory 
a  bit  that  is  the  EXCLUSIVE  OR  of  the  initial  value  stored  in  its  memory  and  the  external  inputs 
received  to  the  present  time. 


in  lockstep  can  be  interconnected  to  form  a  single  FSM.  In  this  case,  some  outputs  of  one  FSM 
serve  as  inputs  to  the  other. 

Finite-state  machines  are  ubiquitous  today.  They  are  found  in  microwave  ovens,  VCRs  and 
automobiles.  They  can  be  simple  or  complex.  One  of  the  most  useful  FSMs  is  the  general- 
purpose  computer  modeled  by  the  random-access  machine. 

1.4.3  Random-Access  Machine 

The  (bounded-memory)  random-access  machine  (RAM)  is  modeled  as  a  pair  of  intercon¬ 
nected  finite-state  machines,  one  a  central  processing  unit  (CPU)  and  the  other  a  random- 
access  memory,  as  suggested  in  Fig.  1 .6.  The  random-access  memory  holds  m  6-bit  words, 
each  identified  by  an  address.  It  also  holds  an  output  word  (outjwrd)  and  a  triple  of  inputs 


Random-Access  Memory 


Figure  1 .6  The  bounded-memory  random-access  machine. 
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consisting  of  a  command  ( cmd ),  an  address  ( addr ),  and  an  input  data  word  ( injwrd ).  cmd 
is  either  READ,  WRITE,  or  NO-OP.  A  NO-OP  command  does  nothing  whereas  a  READ  com¬ 
mand  changes  the  value  of  outjwrd  to  the  value  of  the  data  word  at  address  addr.  A  WRITE 
command  replaces  the  data  word  at  address  addr  with  the  value  of  iruwrd. 

The  random-access  memory  holds  data  as  well  as  programs,  collections  of  instructions 
for  the  CPU.  The  CPU  executes  the  fetch-and-execute  cycle  in  which  it  repeatedly  reads  an 
instruction  from  the  random-access  memory  and  executes  it.  Its  instructions  typically  include 
arithmetic,  logic,  comparison,  and  jump  instructions.  Comparisons  are  used  to  decide  whether 
the  CPU  reads  the  next  program  instruction  in  sequence  or  jumps  to  an  instruction  out  of 
sequence. 

The  general-purpose  computer  is  much  more  complex  than  suggested  by  the  above  brief 
sketch  of  the  RAM.  It  uses  a  rich  variety  of  methods  to  achieve  high  speed  at  low  cost  with  the 
available  technology.  For  example,  as  the  number  of  components  that  can  fit  on  a  semiconduc¬ 
tor  chip  increases,  designers  have  begun  to  use  “super-scalar”  CPUs,  CPUs  that  issue  multiple 
instructions  in  each  time  step.  Also,  memory  hierarchies  are  becoming  more  prevalent  as  de¬ 
signers  assemble  collections  of  slower  but  larger  memories  with  lower  costs  per  bit  to  simulate 
expensive  fast  memories. 

1.4.4  Other  Models 

There  are  many  other  models  of  computers  with  memory,  some  of  which  have  an  infinite 
supply  of  data  words,  such  as  the  Turing  machine,  a  machine  consisting  of  a  control  unit  (an 
FSM)  and  a  tape  unit  that  has  a  potentially  infinite  linear  array  of  cells  each  containing  letters 
from  an  alphabet  that  can  be  read  and  written  by  a  tape  head  directed  by  the  control  unit.  It 
is  assumed  that  in  each  time  step  the  head  may  move  only  from  one  cell  to  an  adjacent  one  on 
the  linear  array.  (See  Fig.  1.7.)  The  Turing  machine  is  a  standard  model  of  computation  since 
no  other  machine  model  has  been  discovered  that  performs  tasks  it  cannot  perform. 

The  pushdown  automaton  is  a  restricted  form  of  Turing  machine  in  which  the  tape  is 
used  as  a  pushdown  stack.  Data  is  entered,  deleted,  and  accessed  only  at  the  top  of  a  stack.  A 


Figure  1 .7  The  Turing  machine  has  a  control  unit  that  is  a  finite-state  machine  and  a  tape  unit 
that  controls  reading  and  writing  by  a  tape  head  and  the  movement  of  the  tape  head  one  cell  at  a 
time  to  the  left  or  right  of  the  current  position. 
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pushdown  stack  can  be  simulated  by  a  tape  in  which  the  cell  to  the  right  of  the  tape  head  is 
always  blank.  If  the  tape  moves  right  from  a  cell,  it  writes  a  non-blank  symbol  in  the  cell.  If  it 
moves  left,  it  writes  a  blank  in  that  cell  before  leaving  it. 

Some  computers  are  serial:  they  execute  one  operation  on  a  fixed  amount  of  data  per  time 
step.  Others  are  parallel;  that  is,  they  have  multiple  (usually  communicating)  subcomputers 
that  operate  simultaneously.  They  may  operate  synchronously  or  asynchronously  and  they  may 
be  connected  via  a  simple  or  a  complex  network.  An  example  of  a  simple  network  is  a  wire 
between  two  computers.  An  example  of  a  complex  network  is  a  crossbar  switch  consisting  of 
25  switches  at  the  intersection  of  five  columns  and  five  rows  of  wires;  closing  the  switch  at  the 
intersection  of  a  row  and  a  column  connects  the  two  wires  and  the  two  computers  to  which 
they  are  attached. 

We  close  this  section  by  emphasizing  the  importance  of  models  of  computers.  Good  mod¬ 
els  provide  a  level  of  abstraction  at  which  important  facts  and  insights  can  be  developed  without 
losing  so  much  detail  that  the  results  are  irrelevant  to  practice. 

1.4.5  Formal  Languages 

In  Chapters  4  and  5  the  finite-state  machine,  pushdown  automaton,  and  Turing  machine  are 
characterized  by  their  language  recognition  capability.  Formal  methods  for  specifying  lan¬ 
guages  have  led  to  efficient  ways  to  parse  and  recognize  programming  languages.  This  is  il¬ 
lustrated  by  the  finite-state  machine  of  Fig.  1.8.  Its  initial  state  is  qo,  its  final  state  is  q\  and 
its  inputs  can  assume  values  0  or  1.  An  output  of  0  is  produced  when  the  machine  is  in  state 
qo  and  an  output  of  1  is  produced  when  it  is  in  state  q\ .  The  output  before  the  first  input  is 
received  is  0. 

After  the  first  input  the  output  of  the  FSM  of  Fig.  1.8  is  equal  to  the  input.  After  multiple 
inputs  the  output  is  the  EXCLUSIVE  OR  of  the  l’s  and  0’s  among  the  inputs,  as  we  show  by 
induction.  The  inductive  hypothesis  is  clearly  true  after  one  input.  Suppose  it  is  true  after  k 
inputs;  we  show  that  it  remains  true  after  k+  1  inputs,  and  therefore  for  all  inputs.  The  output 
uniquely  determines  the  state.  There  are  two  cases  to  consider:  after  k  inputs  either  the  FSM  is 
in  state  qo  or  it  is  in  state  q\ .  For  each  state,  there  are  two  cases  to  consider  based  on  the  value 
of  the  k  +  1st  input.  In  all  four  cases  it  is  easy  to  see  that  after  the  k  +  1st  input  the  output  is 
the  EXCLUSIVE  OR  of  the  first  k  +  1  inputs. 


1 


Figure  1.8  A  state  diagram  for  a  finite-state  machine  whose  circuit  model  is  given  in  Fig.  1.5(b). 
qo  is  the  initial  state  of  the  machine  and  q\  is  its  final  state.  If  the  machine  is  in  qo,  it  has  received 
an  even  number  of  1  inputs,  whereas  if  it  is  in  q\,  it  has  received  an  odd  number  of  Is. 
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The  language  recognized  by  an  FSM  is  defined  in  two  ways.  It  is  the  set  of  input  strings 
that  cause  the  FSM  to  produce  a  particular  letter  as  its  last  output  or  to  enter  one  of  the  set 
of  final  states  on  its  last  input.  Thus,  the  FSM  of  Fig.  1.8  recognizes  the  set  of  binary  strings 
containing  an  odd  number  of  Is.  It  also  recognizes  the  set  of  binary  strings  containing  an  even 
number  of  l’s  because  they  result  in  a  last  output  of  0. 

An  FSM  can  also  compute  a  function.  The  most  general  function  that  it  computes  in 
T  steps  is  the  function  :  Q  X  £T  i— >  Q  X  'Pi  that  maps  the  initial  state  s  and  the 
T  inputs  VJ\,W2,  ■  ■  ■  ,wt  to  the  T  outputs  y\,  yi,  ■  ■  . ,  j/t •  It  can  also  compute  any  other 
function  obtained  by  ignoring  some  outputs  or  fixing  either  the  initial  state  or  some  inputs 
or  both. 

The  class  of  languages  recognized  by  finite-state  machines  (the  regular  languages)  is  not 
rich  enough  to  describe  easily  the  important  programming  languages  that  are  in  use  today.  As 
a  consequence,  other  languages,  such  as  the  context-free  languages,  are  employed.  Context- 
free  languages  (which  include  the  regular  languages)  require  computers  with  potentially  un¬ 
bounded  storage  for  their  recognition.  The  class  of  computers  that  recognizes  exactly  the 
context-free  languages  are  the  nondeterministic  pushdown  automata,  pushdown  automata  in 
which  the  control  unit  is  nondeterministic;  that  is,  some  of  its  states  can  have  multiple  poten¬ 
tial  successor  states. 

The  strings  in  regular  and  context-free  languages  (and  other  languages  as  well)  can  be 
generated  by  grammars.  A  context-free  grammar  G  =  (TV,  7”,  1Z,  S)  consists  of  sets  of  terminal 
and  non-terminal  symbols,  T  and  Af  respectively,  and  rules  1Z  by  which  each  non-terminal 
is  replaced  with  one  or  more  strings  of  terminals  and  non-terminals.  All  string  generations 
start  with  the  special  start  non-terminal  S.  The  language  generated  by  G,  L(G),  contains  the 
strings  of  terminal  characters  produced  by  rewriting  strings  in  this  fashion.  This  is  illustrated 
by  the  context-free  grammar  G  with  two  rules  shown  below. 

EXAMPLE  1 .4. 1  G  =  (Af,  T ,  TZ,  s),  where  Af  =  {s},  T  =  {a,  6},  and  1 Z  consists  of  the  two 
rules 

(a)  S  — >  asb  (b)  S  — >  ab 

Each  application  of  a  rule  derives  another  string,  as  shown  below.  This  grammar  has  only 
two  derivations,  namely  S  — >  aSb  and  S  — >  ab.  The  second  derivation  is  always  the  last  to  be 
used.  (Recall  that  the  language  L(G)  contains  only  terminal  strings.) 

S  — >  aSb 
— >  aasbb 
— >  aaaSbbb 
— >  aaaabbbb 

As  can  be  seen  by  inspection,  the  only  strings  in  L{G)  are  of  the  form  akbk ,  where  ak  denotes 
the  letter  a  repeated  k  times.  Thus,  L(G)  =  {akbk  \  k  >  1}. 

Once  a  grammar  for  a  regular  or  context-free  language  is  known,  it  is  possible  to  parse  a 
string  in  the  language.  In  the  above  example  this  amounts  to  determining  the  number  of  times 
that  the  first  rule  is  applied. 

To  develop  some  intuition  for  the  use  of  the  pushdown  automaton  as  a  recognizer  for 
context-free  languages,  observe  that  we  can  determine  the  number  of  applications  of  the  first 
rule  in  this  language  by  pushing  each  instance  of  a  onto  a  stack  and  then  popping  as  as  6’s  are 
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encountered.  The  number  of  os  can  then  be  matched  with  the  number  of  b’s  and  if  they  are 
not  equal,  the  string  is  declared  not  in  the  language.  If  equal,  the  number  of  instances  of  the 
first  rule  is  determined. 

Programming  languages  contain  strings  of  characters  and  digits  representing  names  and 
the  values  of  variables.  Such  strings  can  typically  be  scanned  with  finite-state  machines.  Once 
scanned,  these  strings  can  be  assigned  tokens  that  are  then  used  in  a  later  parsing  phase,  which 
today  is  typically  based  on  a  generalization  of  parsing  for  context-free  languages. 

1.5  Computational  Complexity 

Computational  complexity  is  examined  in  concrete  and  abstract  terms.  The  concrete  analysis 
of  computational  limits  is  done  using  models  that  capture  the  exchange  of  space  for  time.  It  also 
is  done  via  the  study  of  circuit  complexity,  the  minimal  size  and  depth  of  circuits  for  functions. 
Computational  complexity  is  studied  abstractly  via  complexity  classes,  the  classification  of 
languages  by  the  time  and/or  space  they  need. 

1.5.1  A  Computational  Inequality 

Computational  inequalities  play  an  important  role  in  this  book.  We  now  sketch  the  derivation 
of  a  computational  inequality  for  the  finite-state  machine  and  specialize  it  to  the  RAM.  The 
idea  is  very  simple:  we  simulate  with  a  circuit  the  computation  of  a  function  /  by  an  FSM 
and  then  compare  the  size  of  the  circuit  produced  with  the  size  of  the  smallest  circuit  for  /. 
Simulation,  which  we  use  to  derive  this  result,  is  a  central  idea  in  theoretical  computer  science. 
For  example,  it  is  used  to  show  that  a  problem  is  NP-complete.  We  use  it  here  to  relate  the 
resources  available  to  compute  a  function  /  with  an  FSM  to  the  inherent  complexity  of  /. 

Shown  in  Fig.  1.5(a)  is  the  standard  model  for  an  FSM.  As  suggested,  a  circuit  L  combines 
the  current  state  held  in  the  memory  M  together  with  an  external  input  to  form  an  external 
output  and  a  successor  state  which  is  held  in  M .  If  the  input,  output,  and  state  are  represented 
as  binary  tuples,  the  circuit  L  can  be  realized  by  a  logic  circuit  with  Boolean  gates.  Let  the 
FSM  compute  the  function  /  :  Bn  i— >  Bm  in  T  steps;  that  is,  its  state  and/or  T  external 
inputs  contain  the  n  Boolean  inputs  to  /  and  its  T  outputs  contain  the  m  Boolean  outputs  of 
/.  (The  inputs  and  outputs  must  appear  in  the  same  positions  on  each  computation  to  prevent 
the  application  of  hidden  computational  resources.) 

The  function  /  can  also  be  computed  by  the  circuit  shown  in  Fig.  1.9,  which  is  obtained 
by  unwinding  the  loop  of  Fig.  1.5(a)  using  T  copies  of  the  logic  circuit  L  for  the  FSM.  This 


Figure  1 .9  A  circuit  that  computes  the  same  function  as  an  FSM  (see  Fig.  1 .5(a))  in  T  steps.  It 
has  the  same  initial  state  s,  receives  the  same  inputs  and  produces  the  same  outputs. 
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follows  because  the  inputs  X\,  X2,  ■  ■  ■ ,  Xt  that  would  be  given  to  the  FSM  over  time  can  be 
given  simultaneously  to  this  circuit  and  it  will  produce  the  T  outputs  that  would  be  produced 
by  the  FSM.  This  circuit  has  T  -C(L)  gates,  where  C(L)  is  the  actual  or  equivalent  number  of 
gates  used  to  realize  L.  (The  circuit  L  may  be  realized  with  a  technology  that  does  not  formally 
use  gates.)  Since  this  circuit  is  not  necessarily  the  smallest  circuit  for  the  function  /,  we  have 
the  following  inequality,  where  C(f)  is  the  size  of  the  smallest  circuit  for  /: 

C(f)  <  T  ■  C(L) 

This  result  is  important  because  it  imposes  a  constraint  on  every  computation  done  by  a 
sequential  machine.  This  inequality  has  two  interpretations.  First,  if  the  product  T  ■  C (L) 
(the  equivalent  number  of  logic  operations  employed)  of  the  number  of  time  steps  T  and 
the  equivalent  number  of  logic  operations  C(L)  per  step  is  too  small,  namely,  less  than  C(f), 
the  FSM  cannot  compute  function  /  because  the  above  inequality  would  be  violated.  This  is 
a  form  of  impossibility  theorem  for  bounded  computations.  Second,  a  complex  function, 
one  for  which  C(f)  is  large,  requires  a  large  value  for  the  product  T  ■  C  (L) .  In  light  of  the  first 
interpretation  of  T  ■  C (L)  as  the  equivalent  number  of  logic  operations  employed,  it  makes 
sense  to  call  W  =  T  ■  C(L)  the  computational  work  done  by  the  FSM  to  compute  /. 

The  above  computational  inequality  can  be  specialized  to  the  bounded-memory  RAM  with 
S  bits  of  memory.  When  S  is  large,  as  it  usually  is,  C(L )  for  the  RAM  is  proportional  to  S.  As 
a  consequence,  for  the  RAM  we  have  the  following  computational  inequality  for  some  positive 
constant  k: 


C(f)  <  uST 

This  inequality  shows  the  central  role  of  circuit  complexity  in  theoretical  computer  science.  It 
also  demonstrates  that  the  space-time  product,  ST,  is  an  important  measure  of  the  complexity 
of  a  problem.  Functions  with  large  circuit  size  can  be  computed  by  a  RAM  only  if  it  either  has 
a  large  storage  capacity  or  executes  many  time  steps  or  both.  Similar  results  exist  for  the  Turing 
machine. 


1.5.2  Tradeoffs  in  Space,  Time,  and  I/O  Operations 

Computational  inequalities  of  the  kind  sketched  above  are  important  but  often  difficult  to 
apply  because  it  is  hard  to  show  that  functions  have  a  large  circuit  size.  For  this  reason  space- 
time  tradeoffs  have  been  studied  under  the  assumption  that  the  type  of  algorithm  or  program 
allowed  is  restricted.  For  example,  if  only  straight-line  programs  are  considered,  then  the  pebble 
game  sketched  below  and  discussed  in  detail  in  Chapter  10  can  be  used  to  derive  tradeoff 
inequalities. 

The  standard  pebble  game  is  played  on  a  directed  acyclic  graph  (DAG),  the  graph  of  a 
straight-line  program.  The  input  vertices  of  a  DAG  have  no  edges  directed  into  them.  Output 
vertices  have  no  edges  directed  away  from  them.  Internal  vertices  are  non-input  vertices.  A 
predecessor  of  a  vertex  v  is  a  vertex  u  that  has  an  edge  directed  to  v.  The  pebble  game  is  played 
with  pebbles  that  are  placed  on  vertices  according  to  the  following  rules: 

•  Initially  no  vertices  carry  pebbles. 

•  A  pebble  can  be  placed  on  an  input  vertex  at  any  time. 
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•  A  pebble  can  be  placed  on  an  internal  vertex  only  if  all  of  its  predecessor  vertices  carry 
pebbles. 

•  The  pebble  moved  to  a  vertex  can  be  a  pebble  residing  on  one  of  its  immediate  predecessors. 

•  A  pebble  can  be  removed  from  a  vertex  at  any  time. 

•  Every  output  vertex  must  carry  a  pebble  at  some  time. 

Space  S  in  this  game  is  the  maximum  number  of  pebbles  used  to  play  the  game  on  a 
DAG.  Time  T  is  the  number  of  times  that  pebbles  are  placed  on  vertices.  If  enough  pebbles 
are  available  to  play  the  game,  each  vertex  is  pebbled  once  and  T  is  the  number  of  vertices  in 
the  graph.  If,  however,  there  are  not  enough  pebbles,  some  vertices  will  have  to  be  pebbled 
more  than  once.  In  this  case  a  tradeoff  between  space  and  time  will  be  exhibited. 

For  a  particular  DAG  G  we  may  seek  to  determine  the  minimum  number  of  pebbles,  S'min, 
needed  to  place  pebbles  on  all  output  vertices  at  some  time  and  for  a  given  number  of  pebbles  S 
to  determine  the  minimum  time  T  needed  when  S  pebbles  are  used.  Methods  for  computing 
S'min  and  bounding  S  and  T  simultaneously  have  been  developed.  For  example,  the  four- 
point  (four-input)  fast  Fourier  transform  (FFT)  graph  shown  in  Fig.  1.10  has  Smjn  =  3  and 
can  be  pebbled  in  the  minimum  number  of  steps  with  five  pebbles. 

Let  the  FFT  graph  of  Fig.  1.10  be  pebbled  with  the  minimum  number  S  of  pebbles. 
Initially  no  pebbles  reside  on  the  graph.  Thus,  there  is  a  first  point  in  time  at  which  S  pebbles 
reside  on  the  graph.  The  dark  gray  vertices  identify  one  possible  placement  of  pebbles  at  such 
a  point  in  time.  The  light  gray  vertices  will  have  had  pebbles  placed  on  them  prior  to  this  time 
and  will  have  to  be  repebbled  again  later  to  pebble  output  vertices  that  cannot  be  reached  from 
the  placement  of  the  dark  gray  vertices.  This  demonstrates  that  for  this  graph  if  the  minimum 
number  of  pebbles  is  used,  some  vertices  will  have  to  be  repebbled.  Although  the  n-point 
FFT  graph,  n  a  power  of  two,  has  only  n  log  n  +  n  vertices,  we  show  in  Section  10.5.5  that  its 
vertices  must  be  repebbled  enough  times  that  S  and  T  satisfy  (S  + 1  )T  >  n2/l6.  Thus,  either 
S  is  much  larger  than  the  minimum  space  or  T  is  much  larger  than  the  number  of  vertices 
or  both. 


70t0X>,0 
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Figure  I .  I  0  A  pebbling  of  a  four-input  FFT  graph  at  the  point  at  which  the  maximum  num¬ 
ber  of  pebbles  (three)  is  used.  Numbers  specify  the  order  in  which  vertices  can  be  pebbled.  A 
maximum  of  three  pebbles  is  used.  Some  vertices  are  pebbled  twice. 
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Space-time  tradeoffs  can  also  be  studied  with  the  branching  program,  a  type  of  program 
that  permits  data-dependent  computations.  (See  Section  10.9.)  While  branching  programs 
provide  more  flexibility  than  does  the  pebble  game,  they  are  worth  considering  only  for  prob¬ 
lems  in  which  the  algorithms  used  involve  branching  and  have  access  to  an  external  random- 
access  memory  to  permit  data-dependent  reading  of  inputs,  a  strong  assumption.  For  many 
problems  only  straight-line  programs  are  used,  in  which  case  the  pebble  game  is  the  model  of 
choice. 

A  serious  problem  arises  when  the  storage  capacity  of  a  primary  memory  is  too  small  for 
a  problem,  so  that  a  slow  secondary  memory,  such  as  a  disk,  must  be  used  for  intermediate 
storage.  This  results  in  time-consuming  input/output  operations  (I/O)  between  primary  and 
secondary  memory.  If  too  many  I/O  operations  are  done,  the  overall  performance  of  the  system 
can  deteriorate  markedly.  This  problem  has  been  exacerbated  by  the  growing  disparity  between 
the  speed  of  CPUs  and  that  of  memories;  the  speed  of  CPUs  is  increasing  over  time  at  a  greater 
rate  than  that  of  memories.  In  fact,  the  latency  of  a  disk,  the  time  between  the  issuance  of  a 
request  for  data  and  the  time  it  is  answered,  can  be  100,000  to  1,000,000  times  the  length  of  a 
CPU  cycle.  As  a  consequence,  the  amount  of  time  spent  swapping  data  between  primary  and 
secondary  memory  may  dominate  the  time  to  perform  computations.  A  second  pebble  game, 
the  red-blue  pebble  game,  has  been  introduced  to  study  this  problem.  (See  Chapter  11.) 

The  red-blue  pebble  game  is  played  with  both  red  and  blue  pebbles.  The  (hot)  red  pebbles 
correspond  to  primary  memory  locations  and  the  (cool)  blue  pebbles  correspond  to  secondary 
memory  locations.  Red  pebbles  are  played  according  to  the  rules  of  the  above  pebble  game. 
The  additional  rules  that  apply  to  the  red  and  blue  pebbles  allow  a  red  pebble  to  be  swapped 
for  a  blue  one  and  vice  versa.  In  addition,  blue  pebbles  reside  only  on  inputs  initially  and 
must  reside  on  outputs  finally.  The  number  of  red  pebbles  is  limited,  but  the  number  of  blue 
pebbles  is  not. 

The  goal  of  the  red-blue  pebble  game  is  to  minimize  the  number  of  times  that  red  and 
blue  pebbles  are  swapped,  since  each  swap  corresponds  to  an  expensive  input/output  (I/O) 
operation.  Let  T  be  the  number  of  I/O  operations  and  S  be  the  number  of  red  pebbles. 
Upper  and  lower  bounds  on  the  exchange  of  S  for  T  have  been  derived  for  a  large  number  of 
problems.  For  example,  for  the  problem  of  multiplying  two  n  x  n  matrices  in  about  2n3  steps 
with  the  classical  algorithm,  it  has  been  shown  that  a  red-blue  pebble-game  strategy  leads  to  a 
product  ST2  proportional  to  n 6  and  that  this  cannot  be  beaten  except  by  a  small  multiplicative 
factor. 


1.5.3  Complexity  Classes 

Complexity  classes  provide  a  way  to  group  languages  of  similar  computational  complexity.  For 
example,  the  nondeterministic  polynomial-time  languages  (NP)  are  languages  that  can  be 
solved  in  time  that  is  polynomial  in  the  size  of  their  input  when  the  machine  in  question  is 
a  nondeterministic  Turing  machine  (TM).  Nondeterministic  Turing  machines  can  have  more 
than  one  state  that  is  a  successor  to  the  current  state  for  the  current  input.  Thus,  they  can 
make  choices  between  successor  states.  A  language  L  is  in  NP  if  there  is  a  nondeterministic 
TM  such  that,  given  an  arbitrary  string  in  L,  there  is  some  choice  of  successor  states  for  the 
TM  control  unit  that  causes  the  TM  to  enter  an  accepting  state  in  a  number  of  steps  that  is 
polynomial  in  the  length  of  the  input. 

An  NP-complete  language  Lq  must  satisfy  two  conditions.  First,  Lq  must  be  in  NP  and 
second,  it  must  be  true  that  for  each  language  L  in  NP  a  string  x  in  L  can  be  translated 
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into  a  string  y  of  Lq  using  an  algorithm  whose  running  time  is  a  polynomial  in  the  length 
of  x  such  that  y  is  in  Lq  if  and  only  if  x  is  in  L.  As  a  consequence  of  this  definition,  if  any 
NP-complete  language  can  be  solved  in  deterministic  polynomial  time,  then  every  language  in 
NP  can,  including  all  the  other  NP-complete  languages.  However,  the  best  algorithms  known 
today  for  NP-complete  languages  all  have  exponential  running  time.  Thus,  for  long  strings 
these  algorithms  are  impractical.  If  solutions  to  large  NP-complete  languages  are  needed,  we 
are  limited  to  approximate  solutions. 

1.5.4  Circuit  Complexity 

Circuit  complexity  is  a  notoriously  difficult  subject.  Despite  decades  of  research,  we  have 
failed  to  find  methods  to  show  that  individual  functions  have  super-polynomial  circuit  size 
or  more  than  poly-logarithmic  depth.  Nonetheless,  the  circuit  is  such  a  simple  and  appealing 
model  that  it  continues  to  attract  a  considerable  amount  of  attention.  Some  very  interesting 
exponential  lower  bounds  on  circuit  size  have  been  derived  when  the  circuits  are  monotone, 
that  is,  realized  by  AND  and  OR  gates  but  no  NOTs. 

1.6  Parallel  Computation 

The  VLSI  machine  and  the  PRAM  are  examples  of  parallel  machines.  The  VLSI  machine 
reflects  constraints  that  exist  when  finite-state  machines  are  realized  through  the  very  large- 
scale  integration  of  components  on  semiconductor  chips.  In  the  VLSI  model  the  area  of  a  chip 
is  important  because  large  chips  have  a  much  higher  probability  of  containing  a  disabling  defect 
than  smaller  ones.  Consequently,  the  absolute  size  of  chips  is  limited.  However,  the  width  of 
lines  that  can  be  drawn  on  chips  has  been  shrinking  over  time,  thereby  increasing  the  number 
of  wires,  gates,  and  binary  memory  cells  that  can  be  placed  on  them.  This  has  the  effect  of 
increasing  the  effective  chip  area,  the  real  chip  area  normalized  by  the  cross  section  of  wires. 

Figure  1 . 1 1  (a)  is  a  VLSI  diagram  representing  the  types  of  material  that  can  be  deposited  on 
the  surface  of  a  pure  crystalline  semiconductor  substrate  to  form  different  types  of  conducting 
regions.  Some  of  the  rectangular  regions  serve  as  wires  whereas  overlaps  of  other  regions  create 
transistors.  In  turn,  collections  of  transistors  form  gates.  This  VLSI  diagram  describes  a  NAND 
gate,  a  gate  whose  Boolean  function  is  the  NOT  of  the  AND  of  its  two  inputs.  Shown  in 
Fig.  1.1 1(b)  is  the  logic  symbol  for  the  NAND  gate.  The  small  circle  at  the  output  of  the  AND 
gate  denotes  the  NOT  of  the  gate  value. 

Given  the  premium  attached  to  chip  real  estate,  a  large  number  of  economical  and  very 
regular  finite-state  machine  designs  have  been  made  for  VLSI  chips.  One  of  the  most  im¬ 
portant  of  these  is  the  systolic  array,  a  one-  or  two-dimensional  array  of  processors  (FSMs) 
that  are  identical,  except  possibly  for  those  along  the  periphery  of  the  array.  These  processors 
operate  in  synchrony;  that  is,  they  perform  the  same  operation  at  the  same  time.  They  also 
communicate  only  with  their  nearest  neighbors.  (The  word  “systolic”  is  derived  from  “systole,” 
a  “rhythmically  recurrent  contraction”  such  as  that  of  the  heart.) 

Systolic  arrays  are  typically  used  to  compute  specific  functions  such  as  the  convolution 
c  =  a  ®  b  of  the  n-tuple  a  =  (do,  Gq, . . . ,  an_  i)  with  the  m-tuple  b  =  ( b0 ,  b\, . . . ,  bm_  i). 
The  jth  component,  Cj,  of  the  convolution  c  =  a  ®  b,  0  <  j  <  (n  +  m  —  2),  is  defined  as 

Cj  —  a  y  ^  bs 

r+s=j 
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Figure  I .  I  I  (a)  A  layout  diagram  for  a  VLSI  chip  and  (b)  its  logic  symbol. 


Figure  1.12  A  systolic  array  for  the  convolution  of  two  binary  sequences. 
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It  is  assumed  that  the  components  of  a  and  b  are  drawn  from  a  set  over  which  the  operations 
of  *  (multiplication)  and  (addition)  are  defined,  such  as  the  integers. 

Shown  schematically  in  Fig.  1.12  on  page  28  is  the  one-dimensional  systolic  array  for  the 
convolution  c  =  a  0  b  at  the  second,  fourth,  fifth,  and  sixth  steps  of  execution  on  input 
vectors  a  =  (do,  a\,  CLi)  and  b  =  ( b0 ,  b\,  b2).  The  components  of  these  vectors  are  fed  from 
the  left  and  right,  respectively,  spaced  by  zero  elements.  The  first  component  of  a  enters  the 
array  one  step  ahead  of  the  first  component  of  b.  The  result  of  the  convolution  is  the  vector 
c  =  (cq,  Ci,  C2,  C3,  C4).  There  is  one  more  cell  in  the  array  than  there  are  components  in  the 
result.  At  each  step  the  components  of  a  and  b  in  each  cell  are  multiplied  and  added  to  the 
previous  value  of  the  component  of  c  in  that  cell.  After  all  components  of  the  two  input  vectors 
pass  through  the  cell,  the  convolution  is  computed. 

The  processors  of  a  parallel  computer  generally  do  not  communicate  only  with  nearest 
neighbors,  as  in  the  systolic  array.  Instead,  processors  often  can  communicate  with  remote 
neighbors  via  a  network.  The  type  of  networks  chosen  for  a  parallel  computer  can  have  a  large 
impact  on  their  effectiveness. 

The  processors  of  the  PRAM  mentioned  in  Section  1 . 1  operate  synchronously,  alternating 
between  accessing  a  global  memory  and  computing  locally.  Since  the  processors  communicate 
by  writing  and  reading  values  to  and  from  the  global  memory,  all  processors  are  at  the  same 
distance  from  one  another.  Although  the  PRAM  model  makes  two  unrealistic  assumptions, 
namely  that  processors  a)  can  act  in  synchrony  and  b)  they  can  communicate  directly  via  global 
memory,  it  remains  a  good  model  in  which  to  explore  problems  that  are  hard  to  parallelize, 
even  with  the  flexibility  offered  by  this  model. 


Problems 

MATHEMATICAL  PRELIMINARIES 

1.1  Show  that  the  sum  S(k)  below  has  value  S(k )  —  2k  —  1: 

S(k)  =  2fc_1  +  2k~2  H - h  21  +  2° 

SETS,  LANGUAGES,  INTEGERS,  AND  GRAPHS 

1.2  Let  A  =  {red,  green,  blue},  B  =  {green,  violet},  and  C  =  {red,  yellow,  blue,  green}. 
Determine  the  elements  in  ( A  0  C)  x  ( B  —  C ). 

1 .3  Let  the  relation  R  C  IN  X  IN  be  defined  by  pairs  (a,  b)  such  that  a  and  b  have  the  same 
remainder  on  division  by  3.  Show  that  R  is  an  equivalence  relation. 

1.4  Let  R  C  A  x  A  be  an  equivalence  relation.  Let  the  set  E\a ]  be  the  elements  in  A 
equivalent  under  the  relation  R  to  the  element  a.  Show  that  for  all  a,b  £  A  the 
equivalence  classes  E[a ]  and  E[b]  are  either  equal  or  disjoint.  Also  show  that  A  is  the 
union  of  all  equivalence  classes. 

1.5  In  terms  of  the  Kleene  closure  and  the  concatenation  of  sets,  describe  the  languages 
containing  the  following: 

a)  Strings  over  {0,  1}  beginning  with  01. 

b)  Strings  beginning  with  0  that  alternate  between  0  and  1 . 
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1 .6  Describe  an  algorithm  to  convert  numbers  from  decimal  to  binary  notation. 

1.7  A  graph  G  =  (V,  E)  can  be  described  by  adjacency  lists,  one  list  for  each  vertex  in  the 
graph.  The  adjacency  list  for  vertex  v  €  V  is  a  list  of  vertices  to  which  there  is  an  edge 
from  v.  Generate  adjacency  lists  for  the  two  graphs  of  Fig.  1.2. 

TASKS  AS  FUNCTIONS 

1.8  Let  Z5  be  the  set  {0, 1,  2,  3,  4}.  Let  the  addition  operator  ©  over  this  set  be  modulo  5; 
that  is,  if  x  and  y  are  two  such  integers,  x  ©  y  is  obtained  by  adding  x  and  y  as  integers 
and  taking  the  remainder  after  division  by  5.  For  example,  2  ©  2  =  4  mod  5  whereas 
3  ©  4  =  7  =  2  mod  5.  Provide  a  table  describing  the  function  :  S5  X  S5  n  S5. 

1 .9  Give  a  truth  table  for  the  Boolean  function  whose  value  is  True  exactly  when  either  x 
or  y  or  both  is  True  and  z  is  False. 


RATE  OF  GROWTH  OF  FUNCTIONS 

1.10  For  each  of  the  fifteen  unordered  pairs  of  functions  /  and  g  below,  determine  whether 
f(n)  =  0(g(n)),  f(n)  =  D(5(n)),  or  f(n)  =  0(g(n)). 

a)  n3;  c)  n6;  e)  n3log2n; 

b)  2nlo^n-  d)  n2n-  f)  22”. 

1.11  Show  that  2.7n2  +  6y/n|"log2  n]  <  8.7n2  for  n  >  3. 


METHODS  OF  PROOF 


1.12  Let  Sr(n)  =  Y^j= 1  Jr  denote  a  sum  of  powers  of  integers.  Use  proof  by  induction  to 
show  that  the  following  identities  on  arithmetic  series  hold: 

a)  S2(n)  =  ^  +  ^  +  f 

b)  S3(n)  =  ^  +  ^  +  i 


COMPUTATIONAL  MODELS 

1.13  Produce  a  circuit  and  straight-line  program  for  the  Boolean  function  described  in  Prob¬ 
lem  1.9. 

1.14  A  state  diagram  for  a  finite-state  machine  is  a  graph  containing  one  vertex  (or  state) 
for  each  pattern  of  data  that  can  be  held  in  its  memory  and  an  edge  from  state  p  to 
state  q  if  there  is  a  value  for  the  input  data  that  causes  the  memory  to  change  from  p 
to  q.  Such  an  edge  is  labeled  with  the  value  of  the  input  data  that  causes  the  transition. 
Outputs  are  generated  by  a  finite-state  machine  when  it  is  in  a  state.  The  vertices  of  its 
state  diagram  are  labeled  by  these  outputs. 

Provide  a  state  diagram  for  the  finite-state  machine  described  in  Fig.  1.5(b). 

1.15  Using  the  straight-line  program  given  for  the  full-adder  circuit  in  Section  1.4.1,  describe 
how  such  a  program  would  be  placed  in  the  random-access  memory  of  the  RAM  and 
how  the  RAM  would  run  the  fetch-and-execute  cycle  to  compute  the  values  produced 
by  the  full-adder  circuit.  This  is  an  example  of  circuit  simulation  by  a  program. 
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1.16  Describe  the  actions  that  could  be  taken  by  a  Turing  machine  to  simulate  a  circuit  from 
a  straight-line  program  for  it.  Illustrate  your  approach  by  applying  it  to  the  simulation 
of  the  full-adder  circuit  described  in  Section  1.4.1. 

1.17  Suppose  you  are  told  that  a  function  is  computed  in  four  time  steps  by  a  very  simple 
finite-state  machine,  one  whose  logic  circuit  (but  not  its  memory)  can  be  realized  with 
four  logic  gates.  Suppose  you  are  also  told  that  the  same  function  cannot  be  computed 
by  a  logic  circuit  with  fewer  than  20  logic  gates.  What  can  be  said  about  these  two 
statements?  Explain  your  answer. 

1.18  Describe  a  finite-state  machine  that  recognizes  the  language  consisting  of  those  strings 
over  {0,  1}  that  end  in  1. 

1.19  Determine  the  language  generated  by  the  context-free  grammar  G  =  ( Af,T,lZ,S ) 
where  Af  =  {S,  M,  n},  T  =  {a,  b,  c,  d}  and  1Z  consists  of  the  rules  given  below. 


a) 

S  - 

MN 

d)  N  - 

->  cNd 

b) 

M  - 

->  aMb 

e)  N  - 

->  cd 

c) 

M  - 

-»  ab 

COMPUTATIONAL  COMPLEXITY 

1 .20  Using  the  rules  for  the  red  pebble  game,  show  how  to  pebble  the  FFT  graph  of  Fig.  1.10 
with  five  red  pebbles  by  labeling  the  vertices  with  the  time  step  on  which  it  is  pebbled. 
If  a  vertex  has  to  be  repebbled,  it  will  be  pebbled  on  two  time  steps. 

1.21  Suppose  that  you  are  told  that  the  n- point  FFT  graph  can  be  pebbled  with  y/n  pebbles 
in  n/4  time  steps  for  n  >  37.  What  can  you  say  about  this  statement? 

1.22  You  have  been  told  that  the  FFT  graph  of  Fig.  1.10  cannot  be  pebbled  with  fewer  than 
three  red  pebbles.  Show  that  it  can  be  pebbled  with  two  red  pebbles  in  the  red-blue 
pebble  game  by  sketching  how  you  would  use  blue  pebbles  to  achieve  this  objective. 

PARALLEL  COMPUTATION 

1.23  Using  Fig.  1.12  as  a  guide,  design  a  systolic  array  to  convolve  two  sequences  of  length 
two.  Sketch  out  each  step  of  the  convolution  process. 

1 .24  Consider  a  version  of  the  PRAM  consisting  of  a  collection  of  RAMs  (see  Fig.  1.13)  with 
small  local  random-access  memories  that  repeat  the  following  three-step  cycle  until  they 
halt:  a)  they  simultaneously  read  one  word  from  a  common  global  memory,  b)  they 
execute  one  local  instruction  using  local  memory,  and  c)  they  write  one  word  to  the 
common  memory.  When  reading  and  writing,  the  individual  processors  are  allowed 
to  read  and  write  from  the  same  location.  If  two  RAMs  write  to  the  same  location, 
they  must  be  programmed  so  that  they  write  a  common  value.  (This  is  known  as  the 
concurrent-read,  concurrent-write  (CRCW)  PRAM.)  Each  RAM  has  a  unique  integer 
associated  with  it  and  can  use  this  number  to  decide  where  to  read  or  write  in  the 
common  memory. 

Show  that  the  CRCW  PRAM  can  compute  the  AND  of  n  Boolean  variables  in  two 
cycles. 

Hint:  Reserve  one  word  in  common  memory  and  initialize  it  with  0  and  assign  RAMs 
to  the  appropriate  memory  cells. 
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Figure  1.13  The  PRAM  model  is  a  collection  of  synchronous  RAMs  accessing  a  common 
memory. 

Chapter  Notes 

Since  this  chapter  introduces  concepts  used  elsewhere  in  the  book,  we  postpone  the  biblio¬ 
graphic  citations  to  later  chapters.  We  remark  here,  however,  that  the  notation  for  the  rate  of 
growth  of  functions  in  Section  1.2.8  is  due  to  Knuth  [171].  The  reader  interested  in  more  in¬ 
formation  on  the  development  of  the  digital  computer,  ranging  from  Babbage’s  seminal  work 
in  the  1830s  to  the  pioneering  work  of  the  1940s,  should  consult  the  collection  of  papers 
selected  and  edited  by  Brian  Randell  [268]. 
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Logic  Circuits 


Many  important  functions  are  naturally  computed  with  straight-line  programs,  programs 
without  loops  or  branches.  Such  computations  are  conveniently  described  with  circuits,  di¬ 
rected  acyclic  graphs  of  straight-line  programs.  Circuit  vertices  are  associated  with  program 
steps,  whereas  edges  identify  dependencies  between  steps.  Circuits  are  characterized  by  their 
size,  the  number  of  vertices,  and  their  depth,  the  length  (in  edges)  of  their  longest  path. 
Circuits  in  which  the  operations  are  Boolean  are  called  logic  circuits,  those  using  algebraic 
operations  are  called  algebraic  circuits,  and  those  using  comparison  operators  are  called  com¬ 
parator  circuits.  In  this  chapter  we  examine  logic  circuits.  Algebraic  and  comparator  circuits 
are  examined  in  Chapter  6. 

Logic  circuits  are  the  basic  building  blocks  of  real-world  computers.  As  shown  in  Chap¬ 
ter  3,  all  machines  with  bounded  memory  can  be  constructed  of  logic  circuits  and  binary 
memory  units.  Furthermore,  machines  whose  computations  terminate  can  be  completely  sim¬ 
ulated  by  circuits. 

In  this  chapter  circuits  are  designed  for  a  large  number  of  important  functions.  We  begin 
with  a  discussion  of  circuits,  straight-line  programs,  and  the  functions  computed  by  them. 
Normal  forms,  a  structured  type  of  circuit,  are  examined  next.  They  are  a  starting  point  for 
the  design  of  circuits  that  compute  functions.  We  then  develop  simple  circuits  that  combine 
and  select  data.  They  include  logical  circuits,  encoders,  decoders,  multiplexers,  and  demulti¬ 
plexers.  This  is  followed  by  an  introduction  to  prefix  circuits  that  efficiently  perform  running 
sums.  Circuits  are  then  designed  for  the  arithmetic  operations  of  addition  (in  which  prefix 
computations  are  used),  subtraction,  multiplication,  and  division.  We  also  construct  efficient 
circuits  for  symmetric  functions.  We  close  with  proofs  that  every  Boolean  function  can  be 
realized  with  size  and  depth  exponential  and  linear,  respectively,  in  its  number  of  inputs,  and 
that  most  Boolean  functions  require  such  circuits. 

The  concept  of  a  reduction  from  one  problem  to  a  previously  solved  one  is  introduced  in 
this  chapter  and  applied  to  many  simple  functions.  This  important  idea  is  used  later  to  show 
that  two  problems,  such  as  different  NP-complete  problems,  have  the  same  computational 
complexity.  (See  Chapters  3  and  8.) 
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2.1  Designing  Circuits 

The  logic  circuit,  as  defined  in  Section  1.4.1,  is  a  directed  acyclic  graph  (DAG)  whose  vertices 
are  labeled  with  the  names  of  Boolean  functions  (logic  gates)  or  variables  (inputs).  Each  logic 
circuit  computes  a  binary  function  f  :  Bn  i— >  Bm  that  is  a  mapping  from  the  values  of  its  n 
input  variables  to  the  values  of  its  m  outputs. 

Computer  architects  often  need  to  design  circuits  for  functions,  a  task  that  we  explore  in 
this  chapter.  The  goal  of  the  architect  is  to  design  efficient  circuits,  circuits  whose  size  (the 
number  of  gates)  and/or  depth  (the  length  of  the  longest  path  from  an  input  to  an  output 
vertex)  is  small.  The  computer  scientist  is  interested  in  circuit  size  and  depth  because  these 
measures  provide  lower  bounds  on  the  resources  needed  to  complete  a  task.  (See  Section  1.5.1 
and  Chapter  3.)  For  example,  circuit  size  provides  a  lower  bound  on  the  product  of  the 
space  and  time  needed  for  a  problem  on  both  the  random-access  and  Turing  machines  (see 
Sections  3.6  and  3.9.2)  and  circuit  depth  is  a  measure  of  the  parallel  time  needed  to  compute 
a  function  (see  Section  8.14.1). 

The  logic  circuit  also  provides  a  framework  for  the  classification  of  problems  by  their  com¬ 
putational  complexity.  For  example,  in  Section  3.9.4  we  use  circuits  to  identify  hard  compu¬ 
tational  problems,  in  particular,  the  P-complete  languages  that  are  believed  hard  to  parallelize 
and  the  NP-complete  languages  that  are  believed  hard  to  solve  on  serial  computers.  After  more 
than  fifty  years  of  research  it  is  still  unknown  whether  NP-complete  problems  have  polynomial¬ 
time  algorithms. 

In  this  chapter  not  only  do  we  describe  circuits  for  important  functions,  but  we  show  that 
most  Boolean  functions  are  complex.  For  example,  we  show  that  there  are  so  many  Boolean 
functions  on  n  variables  and  so  few  circuits  containing  C  or  fewer  gates  that  unless  C  is  large, 
not  all  Boolean  functions  can  be  realized  with  C  gates  or  fewer. 

Circuit  complexity  is  also  explored  in  Chapter  9.  The  present  chapter  develops  methods 
to  derive  lower  bounds  on  the  size  and  depth  of  circuits.  A  lower  bound  on  the  circuit  size 
(depth)  of  a  function  /  is  a  value  for  the  size  (depth)  below  which  there  does  not  exist  a  circuit 
for  /.  Thus,  every  circuit  for  /  must  have  a  size  (depth)  greater  than  or  equal  to  the  lower 
bound.  In  Chapter  9  we  also  establish  a  connection  between  circuit  depth  and  formula  size, 
the  number  of  Boolean  operations  needed  to  realize  a  Boolean  function  by  a  formula.  This 
allows  us  to  derive  an  upper  bound  on  formula  size  from  an  upper  bound  on  depth.  Thus,  the 
depth  bounds  of  this  chapter  are  useful  in  deriving  upper  bounds  on  the  size  of  the  smallest 
formulas  for  problems.  Prefix  circuits  are  used  in  the  present  chapter  to  design  fast  adders. 
They  are  also  used  in  Chapter  6  to  design  fast  parallel  algorithms. 


2.2  Straight-Line  Programs  and  Circuits 

As  suggested  in  Section  1.4.1,  the  mapping  between  inputs  and  outputs  of  a  logic  circuit  can 
be  described  by  a  binary  function.  In  this  section  we  formalize  this  idea  and,  in  addition, 
demonstrate  that  every  binary  function  can  be  realized  by  a  circuit.  Normal-form  expansions 
of  Boolean  functions  play  a  central  role  in  establishing  the  latter  result.  Circuits  were  defined 
informally  in  Section  1.4.1.  We  now  give  a  formal  definition  of  circuits. 

To  fix  ideas,  we  start  with  an  example.  Figure  2.1  shows  a  circuit  that  contains  two  AND 
gates,  one  OR  gate,  and  two  NOT  gates.  (Circles  denote  NOT  gates,  AND  and  OR  gates  are 
labeled  A  and  V,  respectively.)  Corresponding  to  this  circuit  is  the  following  functional  de- 
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Figure  2.1  A  circuit  is  the  graph  of  a  Boolean  straight-line  program. 


scription  of  the  circuit,  where  gj  is  the  value  computed  by  the  jth  input  or  gate  of  the  circuit: 


9 1 

:=  x; 

9  5 

■= 

92 

:=  y, 

96 

:=  93  A  g2; 

93 

■=  9u 

97 

■=  93  V  g6; 

94 

■=  52; 

The  statement  g\  :=  x;  means  that  the  external  input  x  is  the  value  associated  with  the  first 
vertex  of  the  circuit.  The  statement  <73  :=  g^,  means  that  the  value  computed  at  the  third 
vertex  is  the  NOT  of  the  value  computed  at  the  first  vertex.  The  statement  g$  :=  g\  A  <74; 
means  that  the  value  computed  at  the  fifth  vertex  is  the  AND  of  the  values  computed  at  the 
first  and  fourth  vertices.  The  statement  g-7  :=  g$  V  g$;  means  that  the  value  computed  at  the 
seventh  vertex  is  the  OR  of  the  values  computed  at  the  fifth  and  sixth  vertices.  The  above  is 
a  description  of  the  functions  computed  by  the  circuit.  It  does  not  explicitly  specify  which 
function(s)  are  the  outputs  of  the  circuit. 

Shown  below  is  an  alternative  description  of  the  above  circuit  that  contains  the  same  infor¬ 
mation.  It  is  a  straight-line  program  whose  syntax  is  closer  to  that  of  standard  programming 
languages.  Each  step  is  numbered  and  its  associated  purpose  is  given.  Input  and  output 
steps  are  identified  by  the  keywords  READ  and  OUTPUT,  respectively.  Computation  steps 
are  identified  by  the  keywords  AND,  OR,  and  NOT. 


(1 

READ 

x) 

(6 

AND 

3 

2) 

(2 

READ 

y ) 

(7 

OR 

5 

6) 

(3 

NOT 

1) 

(8 

OUTPUT 

5) 

(2.2) 

(4 

NOT 

2) 

(9 

OUTPUT 

7) 

(5 

AND 

1  4) 

The  correspondence  between  the  steps  of  a  straight-line  program  and  the  functions  computed 
at  them  is  evident. 

Straight-line  programs  are  not  limited  to  describing  logic  circuits.  They  can  also  be  used  to 
describe  algebraic  computations.  (See  Chapter  6.)  In  this  case,  a  computation  step  is  identified 
with  a  keyword  describing  the  particular  algebraic  operation  to  be  performed.  In  the  case  of 
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logic  circuits,  the  operations  can  include  many  functions  other  than  the  basic  three  mentioned 
above. 

As  illustrated  above,  a  straight-line  program  can  be  constructed  for  any  circuit.  Similarly, 
given  a  straight-line  program,  a  circuit  can  be  drawn  for  it  as  well.  We  now  formally  define 
straight-line  programs,  circuits,  and  characteristics  of  the  two. 

DEFINITION  2.2. 1  A  straight-line  program  is  set  of  steps  each  of  which  is  an  input  step,  de¬ 
noted  (s  READ  x),  an  output  step,  denoted  (s  OUTPUT  i),  or  a  computation  step,  denoted 
(s  OP  i  ...  k ).  Here  s  is  the  number  of  a  step,  x  denotes  an  input  variable,  and  the  keywords 
READ,  OUTPUT,  and  OP  identify  steps  in  which  an  input  is  read,  an  output  produced,  and  the 
operation  OP  is  performed.  In  the  sth  computation  step  the  arguments  to  OP  are  the  residts  produced 
at  steps  i, ...  ,k.  It  is  required  that  these  steps  precede  the  sth  step;  that  is,  s  >  i, ...,  k. 

A  circuit  is  the  graph  of  a  straight-line  program.  (The  requirement  that  each  computation 
step  operate  on  results  produced  in  preceding  steps  insures  that  this  graph  is  a  DAG.)  The  fan-in 
of  the  circuit  is  the  maximum  in-degree  of  any  vertex.  The  fan-out  of  the  circuit  is  the  maximum 
outdegree  of  any  vertex.  A  gate  is  the  vertex  associated  with  a  computation  step. 

The  basis  hi  of  a  circuit  and  its  corresponding  straight-line  program  is  the  set  of  operations 
that  they  use.  The  bases  of  Boolean  straight-line  programs  and  logic  circuits  contain  only  Boolean 
functions.  The  standard  basis,  for  a  logic  circuit  is  the  set  {AND,  OR,  NOT}. 

2.2. 1  Functions  Computed  by  Circuits 

As  stated  above,  each  step  of  a  straight-line  program  computes  a  function.  We  now  define  the 
functions  computed  by  straight-line  programs,  using  the  example  given  in  Eq.  (2.2). 

DEFINITION  2.2.2  Let  gs  be  the  function  computed  by  the  sth  step  of  a  straight-line  pro¬ 
gram.  If  the  sth  step  is  the  input  step  (s  READ  x),  then  gs  =  x.  If  it  is  the  computation 
step  (s  OP  i  ...  k),  the  function  is  gs  =  OP  (gi, . . . ,  gf),  where  gi,...,gk  are  the  functions 
computed  at  steps  on  which  the  sth  step  depends.  If  a  straight-line  program  has  n  inputs  and  m 
outputs,  it  computes  a  function  f  :  Bn  i— >  Bm.  If  S\,  s^, ...  ,sm  are  the  output  steps,  then 
f  =  (gSl,gSl, . . .  ,gSm).  The  function  computed  by  a  circuit  is  the  function  computed  by  the 
corresponding  straight-line  program. 

The  functions  computed  by  the  logic  circuit  of  Fig.  2.1  are  given  below.  The  expression 
for  gs  is  found  by  substituting  for  its  arguments  the  expressions  derived  at  the  steps  on  which 
it  depends. 


9 1 

:=  x; 

95 

■=  x  Ay; 

92 

■=  y, 

96 

:=  x  Ay; 

9i 

:=  x; 

97 

:=  {x  Ay)  V(xAy); 

94 

■=  y- 

The  function  computed  by  the  above  Boolean  straight-line  program  is  /( x,  y)  =  (55,  gfy. 
The  table  of  values  assumed  by  /  as  the  inputs  x  and  y  run  through  all  possible  values  is  shown 
below.  The  value  of  g-/  is  the  EXCLUSIVE  OR  function. 
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X 

y 

95 

97 

0 

0 

0 

0 

0 

l 

0 

1 

l 

0 

1 

1 

l 

l 

0 

0 

We  now  ask  the  following  question:  “Given  a  circuit  with  values  for  its  inputs,  how  can  the 
values  of  its  outputs  be  computed?”  One  response  it  to  build  a  circuit  of  physical  gates,  supply 
values  for  the  inputs,  and  then  wait  for  the  signals  to  propagate  through  the  gates  and  arrive  at 
the  outputs.  A  second  response  is  to  write  a  program  in  a  high-level  programming  language  to 
compute  the  values  of  the  outputs.  A  simple  program  for  this  purpose  assigns  each  step  to  an 
entry  of  an  array  and  then  evaluates  the  steps  in  order.  This  program  solves  the  circuit  value 
problem;  that  is,  it  determines  the  value  of  a  circuit. 

2.2.2  Circuits  That  Compute  Functions 

Now  that  we  know  how  to  compute  the  function  defined  by  a  circuit  and  its  corresponding 
straight-line  program,  we  ask:  given  a  function,  how  can  we  construct  a  circuit  (and  straight- 
line  program)  that  will  compute  it?  Since  we  presume  that  computational  tasks  are  defined  by 
functions,  it  is  important  to  know  how  to  build  simple  machines,  circuits,  that  will  solve  these 
tasks.  In  Chapter  3  we  show  that  circuits  play  a  central  role  in  the  design  of  machines  with 
memory.  Thus,  whether  a  function  or  task  is  to  be  solved  with  a  machine  without  memory  (a 
circuit)  or  a  machine  with  memory  (such  as  the  random-access  machine),  the  circuit  and  its 
associated  straight-line  program  play  a  key  role. 

To  construct  a  circuit  for  a  function,  we  begin  by  describing  the  function  in  a  table.  As 
seen  earlier,  the  table  for  a  function  ;  Bn  i— »  Bm  has  n  columns  containing  all  2” 

possible  values  for  the  n  input  variables  of  the  function.  Thus,  it  has  2”  rows.  It  also  has 
to  columns  containing  the  to  outputs  associated  with  each  pattern  of  n  inputs.  If  we  let 
Xi,  X2, . . . , xn  be  the  input  variables  of  /  and  let  y \,  yi,  ■  ■  ■ ,  ym  be  its  output  variables, 


X\ 

X2 

x3 

A  3.2) 

J  example 

2/i  2/2 

0 

0 

0 

1 

1 

0 

0 

1 

0 

1 

0 

1 

0 

1 

0 

0 

1 

1 

0 

1 

1 

0 

0 

1 

1 

1 

0 

1 

1 

0 

1 

1 

0 

0 

1 

1 

1 

1 

1 

1 

Figure  2.2  The  truth  table  for  the  function  /^fniple- 
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then  we  write  f(x \,X2,  ■  ■  ■  ,xn )  =  (yi,y2,  ■  ■  ■  ,ym )•  This  is  illustrated  by  the  function 

(o 

/ex kmpto(*i>  xl)  =  fan  2A)  defined  in  Fig.  2.2  on  page  39. 

A  binary  function  is  one  whose  domain  and  codomain  are  Cartesian  products  of  B  = 
{0,  1}.  A  Boolean  function  is  a  binary  function  whose  codomain  consists  of  the  set  B.  In 
other  words,  it  has  one  output. 

As  we  see  in  Section  2.3,  normal  forms  provide  standard  ways  to  construct  circuits  for 
Boolean  functions.  Because  a  normal-form  expansion  of  a  function  generally  does  not  yield 
a  circuit  of  smallest  size  or  depth,  methods  are  needed  to  simplify  the  algebraic  expressions 
produced  by  these  normal  forms.  This  topic  is  discussed  in  Section  2.2.4. 

Before  exploring  the  algebraic  properties  of  simple  Boolean  functions,  we  define  the  basic 
circuit  complexity  measures  used  in  this  book. 


2.2.3  Circuit  Complexity  Measures 

We  often  ask  for  the  smallest  or  most  shallow  circuit  for  a  function.  If  we  need  to  compute 
a  function  with  a  circuit,  as  is  done  in  central  processing  units,  then  knowing  the  size  of  the 
smallest  circuit  is  important.  Also  important  is  the  depth  of  the  circuit.  It  takes  time  for 
signals  applied  to  the  circuit  inputs  to  propagate  to  the  outputs,  and  the  length  of  the  longest 
path  through  the  circuit  determines  this  time.  When  central  processing  units  must  be  fast, 
minimizing  circuit  depth  becomes  important. 

As  indicated  in  Section  1.5,  the  size  of  a  circuit  also  provides  a  lower  bound  on  the  space- 
time  product  needed  to  solve  a  problem  on  the  random-access  machine,  a  model  for  modern 
computers.  Consequently,  if  the  size  of  the  smallest  circuit  for  a  function  is  large,  its  space-time 
product  must  be  large.  Thus,  a  problem  can  be  shown  to  be  hard  to  compute  by  a  machine 
with  memory  if  it  can  be  shown  that  every  circuit  for  it  is  large. 

We  now  define  two  important  circuit  complexity  measures. 

DEFINITION  2.2.3  The  size  of  a  logic  circuit  is  the  number  of  gates  it  contains.  Its  depth  is  the 
number  of  gates  on  the  longest  path  through  the  circuit.  The  circuit  size,  Cn(f),  and  circuit 
depth,  79q(/),  of  a  Boolean  function  f  :  Bn  i— >  Bm  are  defined  as  the  smallest  size  and  smallest 
depth  of  any  circuit,  respectively,  over  the  basis  for  f. 

Most  Boolean  functions  on  n  variables  are  very  complex.  As  shown  in  Sections  2.12  and 
2.13,  their  circuit  size  is  proportional  to  2 n/n  and  their  depth  is  approximately  n.  Fortunately, 
most  functions  of  interest  have  much  smaller  size  and  depth.  (It  should  be  noted  that  the  circuit 
of  smallest  size  for  a  function  may  be  different  from  that  of  smallest  depth.) 


2.2.4  Algebraic  Properties  of  Boolean  Functions 

Since  the  operations  AND  (A),  OR  (V),  EXCLUSIVE  OR  (®),  and  NOT  (-i  or  —)  play  a  vital 
role  in  the  construction  of  normal  forms,  we  simplify  the  subsequent  discussion  by  describing 
their  properties. 

If  we  interchange  the  two  arguments  of  AND,  OR,  or  EXCLUSIVE  OR,  it  follows  from  their 
definition  that  their  values  do  not  change.  This  property,  called  commutativity,  holds  for  all 
three  operators,  as  stated  next. 
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COMMUTATIVITY 
X\  V  X2  =  X2  V  X\ 
X\  A  X2  =  X2  A  Xi 
Xi  ©  X2  =  *2  ®  X\ 


When  constants  are  substituted  for  one  of  the  variables  of  these  three  operators,  the  expression 
computed  is  simplified,  as  shown  below. 

SUBSTITUTION  OF  CONSTANTS 
X\  V  0  =  X\  X\  A  1  =  X\ 

X\  V  1  =  1  x\  ®  0  =  x\ 

X\  A  0  =  0  X\  ©  1  =  x\ 

Also,  when  one  of  the  variables  of  one  of  these  functions  is  replaced  by  itself  or  its  negation, 

the  functions  simplify,  as  shown  below. 


ABSORPTION  RULES 


X\  V  Xi  =  X\ 
X\  V  X\  =  1 
x\  ®  X\  =  0 

X\  ©  X\  =  1 


X\  A  x\  =  X\ 

X\  A  x\  =  0 

X\  V  (x\  A  x2)  =  X\ 

X\  A  (x\  V  x2)  =  X\ 


To  prove  each  of  these  results,  it  suffices  to  test  exhaustively  each  of  the  values  of  the  arguments 
of  these  functions  and  show  that  the  right-  and  left-hand  sides  have  the  same  value. 

DeMorgan’s  rules,  shown  below,  are  very  important  in  proving  properties  about  circuits 
because  they  allow  each  AND  gate  to  be  replaced  by  an  OR  gate  and  three  NOT  gates  and  vice 
versa.  The  rules  can  be  shown  correct  by  constructing  tables  for  each  of  the  given  functions. 


DEMORGAN’S  RULES 
{x\  V  X2)  =  X\f\X2 
{x\  A  x2)  =  X\Vx2 

The  functions  AND,  OR,  and  EXCLUSIVE  OR  are  all  associative;  that  is,  all  ways  of  combining 
three  or  more  variables  with  any  of  these  functions  give  the  same  result.  (An  operator  ©  is 
associative  if  for  all  values  of  a,  b,  and  c,  a  ©  (b  ©  c)  =  (a  ©  b)  ©  c.)  Again,  proof  by 
enumeration  suffices  to  establish  the  following  results. 

ASSOCIATIVITY 

X\  V  {x2  V  X3)  =  {x\  V  x2)  V  Xi 
X\  A  (x2  A  Xi)  =  {x\  A  x2)  A  Xi 
X\  ©  (x2  ©  Xi)  =  (x\  ©  x2)  ©  Xi 

Because  of  associativity  it  is  not  necessary  to  parenthesize  repeated  uses  of  the  operators  V,  A, 
and  ©. 

Finally,  the  following  distributive  laws  are  important  in  simplifying  Boolean  algebraic 
expressions.  The  first  two  laws  are  the  same  as  the  distributivity  of  integer  multiplication  over 
integer  addition  when  multiplication  and  addition  are  replaced  by  AND  and  OR. 
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DISTRIBUTIVITY 

Xi  A  (X2  V  x$)  =  ( X\  A  X2)  V  {x\  A  x 3) 

X\  A  (x2  ©  £3)  =  (£1  A  X2)  ©  (a;i  A  £3) 

X\  V  (x2  A  x^)  =  ( X\  V  X2)  A  (a?i  V  £3) 

We  often  write  £  A  y  as  xy.  The  operator  A  has  precedence  over  the  operators  V  and  ffi,  which 
means  that  parentheses  in  (x  A  y)  V  z  and  (x  A  y)  ©  z  may  be  dropped. 

The  above  rules  are  illustrated  by  the  following  formula: 

(x  A  (y  ffi  z ))  A  (x  V  y)  =  {x  V  (y  ffi  2))  A  (x  V  y) 

=  (iV(p  z))  A  (xV  y) 

=  xV  (y  A  (y  ®  z)) 

=  x  V  ((y  A  y)  ffi  (y  A  2)) 

=  x  V  (0  ffi  y  A  z) 

=  x\J  {y  A  z) 

DeMorgan’s  second  rule  is  used  to  simplify  the  first  term  in  the  first  equation.  The  last 
rule  on  substitution  of  constants  is  used  twice  to  simplify  the  second  equation.  The  third 
distributivity  rule  and  commutativity  of  A  are  used  to  simplify  the  third  one.  The  second 
distributivity  rule  is  used  to  expand  the  fourth  equation.  The  fifth  equation  is  simplified  by 
invoking  the  third  absorption  rule.  The  final  equation  results  from  the  commutativity  of  ffi 
and  application  of  the  rule  x\  ffi  0  =  X\ .  When  there  is  no  loss  of  clarity,  we  drop  the  operator 
symbol  A  between  two  literals. 

2.3  Normal-Form  Expansions  of  Boolean  Functions 

Normal  forms  are  standard  ways  of  constructing  circuits  from  the  tables  defining  Boolean 
functions.  They  are  easy  to  apply,  although  the  circuits  they  produce  are  generally  far  from 
optimal.  They  demonstrate  that  every  Boolean  function  can  be  realized  over  the  standard  basis 
as  well  as  the  basis  containing  AND  and  EXCLUSIVE  OR. 

In  this  section  we  define  five  normal  forms:  the  disjunctive  and  conjunctive  normal  forms, 
the  sum-of-products  expansion,  the  product-of-sums  expansion,  and  the  ring-sum  expansion. 

2.3.1  Disjunctive  Normal  Form 

A  minterm  in  the  variables  X\,X2, . . .  ,xn  is  the  AND  of  each  variable  or  its  negation.  For 
example,  when  n  =  3,  X\  A  X2  A  x$  is  a  minterm.  It  has  value  1  exactly  when  each  variable 
has  value  0.  X\  A  X2  A  x$  is  another  minterm;  it  has  value  1  exactly  when  X\  =  1,  X2  =  0  and 
Xi  =  1 .  It  follows  that  a  minterm  on  n  variables  has  value  1  for  exactly  one  of  the  2"  points 
in  its  domain.  Using  the  notation  x1  =  x  and  a;0  =  x,  we  see  that  the  above  minterms  can 
be  written  as  x^x^x®  and  XiX^Xi,  respectively,  when  we  drop  the  use  of  the  AND  operator  A. 
Thus,  XiX^Xi  =  1  when  x  =  (x\,  X2,  Xi)  =  (0,  0,  0)  and  x\x®xl  =  1  when  x  =  (1,  0, 1). 
That  is,  the  minterm  x (c)  =  afy1  A  x ^  A  ■  ■  ■  A  has  value  1  exactly  when  x  =  c  where  c  = 
(ci,  C2,  •  .  • ,  <-'n )  •  A  minterm  of  a  Boolean  function  /  is  a  minterm  X(cj  that  contains  all  the 
variables  of  /  and  for  which  /(c)  =  1. 

The  word  “disjunction”  is  a  synonym  for  OR,  and  the  disjunctive  normal  form  (DNF)  of 
a  Boolean  function  /  :  Bn  1 — >  25  is  the  OR  of  the  minterms  of  /.  Thus,  /  has  value  1  when 
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Figure  2.3  Truth  tables  illustrating  the  disjunctive  and  conjunctive  normal  forms. 


exactly  one  of  its  minterms  has  value  1  and  has  value  0  otherwise.  Consider  the  function  whose 
table  is  given  in  Fig.  2.3(a).  Its  disjunctive  normal  form  (or  minterm  expansion)  is  given  by 
the  following  formula: 

f(x  1,X2,X3)  =  X°X2X°  V  x\x\x\  V  2)2223  V  x\x°2x\  V  x\x\x\ 

The  parity  function  /©  :  Bn  <— >  B  on  n  inputs  has  value  1  when  an  odd  number  of 
inputs  is  1  and  value  0  otherwise.  It  can  be  realized  by  a  circuit  containing  n  —  1  instances  of 
the  EXCLUSIVE  OR  operator;  that  is,  (x\, . . . ,  xn)  =  X\  ®  Xj  ®  ■  ■  ■  ®  xn.  However,  the 
DNF  of  /©  contains  2"”1  minterms,  a  number  exponential  in  n.  The  DNF  of  /©  is 

/  2  \ 

f®  \X,y,z)  =xyz  V  xyz  V  xyz  V  xyz 

Here  we  use  the  standard  notation  for  a  variable  and  its  complement. 

2.3.2  Conjunctive  Normal  Form 

A  maxterm  in  the  variables  X\,  X2,  ■  ■  ■ ,  xn  is  the  OR  of  each  variable  or  its  negation.  For 
example,  X\  V  X2  V  23  is  a  maxterm.  It  has  value  0  exactly  when  X\  =  X2  =  0  and  23  =  1. 
X\  V  X2  V  23  is  another  maxterm;  it  has  value  0  exactly  when  2;  =  0  and  22  =  23  =  1. 
It  follows  that  a  maxterm  on  n  variables  has  value  0  for  exactly  one  of  the  2"  points  in  its 
domain.  We  see  that  the  above  maxterms  can  be  written  as  2)  V  x\  V  23  and  2}  V  2°  V  2°, 
respectively.  Thus,  2)  V  22  V  2°  =  0  when  x  =  (21, 22,  23)  =  (0,  0,  1)  and  x{  V  x2  V  23  =  0 
when  x  =  (0,  1, 0).  That  is,  the  maxterm  x ^  =  2j 1  V  xc2  V  •  •  •  V  2°n  has  value  0  exactly 
when  x  =  c.  A  maxterm  of  a  Boolean  function  /  is  a  maxterm  x ' c  1  that  contains  all  the 
variables  of  /  and  for  which  /(c)  =  0. 

The  word  “conjunction”  is  a  synonym  for  AND,  and  the  conjunctive  normal  form  (CNF) 
of  a  Boolean  function  /  :  Bn  B  is  the  AND  of  the  maxterms  of  /.  Thus,  /  has  value  0 
when  exactly  one  of  its  maxterms  has  value  0  and  has  value  1  otherwise.  Consider  the  function 
whose  table  is  given  in  Fig.  2.3(b).  Its  conjunctive  normal  form  (or  maxterm  expansion)  is 
given  by  the  following  formula: 

/( 21,22,23)  =  (2}  V  22  V  2°)  A  (2}  V  x°2  V  2°)  A  (2°  V  2°  V  23) 
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An  important  relationship  holds  between  the  DNF  and  CNF  representations  for  Boolean 
functions.  If  DNF(/)  and  CNF(/)  are  the  representations  of  /  in  the  DNF  and  CNF  expan¬ 
sions,  then  the  following  identity  holds  (see  Problem  2.6): 

CNF(/)  =  DNF(/) 

It  follows  that  the  CNF  of  the  parity  function  has  2"_1  maxterms. 

Since  each  function  /  :  Bn  i— >  Bm  can  be  expanded  to  its  CNF  or  DNF  and  each  can  be 
realized  with  circuits,  the  following  result  is  immediate. 

THEOREM  2.3. 1  Every  function  f  :  Bn  i— »  Bm  can  be  realized  by  a  logic  circuit. 

2.3.3  SOPE  and  POSE  Normal  Forms 

The  sum-of-products  and  product-of-sums  normal  forms  are  simplifications  of  the  disjunctive 
and  conjunctive  normal  forms,  respectively.  These  simplifications  are  obtained  by  using  the 
rules  stated  in  Section  2.2.4. 

A  product  in  the  variables  x^,  x j2, . . . ,  Xik  is  the  AND  of  each  of  these  variables  or  their 
negations.  For  example,  X2  X5  is  a  product.  A  minterm  is  a  product  that  contains  each  of 
the  variables  of  a  function.  A  product  of  a  Boolean  function  /  is  a  product  in  some  of  the 
variables  of  /.  A  sum-of-products  expansion  (SOPE)  of  a  Boolean  function  is  the  OR  (the 
sum)  of  products  of  /.  Thus,  the  DNF  is  a  special  case  of  the  SOPE  of  a  function. 

A  SOPE  of  a  Boolean  function  can  be  obtained  by  simplifying  the  DNF  of  a  function 
using  the  rules  given  in  Section  2.2.4.  For  example,  the  DNF  given  earlier  and  shown  below 
can  be  simplified  to  produce  a  SOPE. 

y\{x\,  X2,  xf  =  X\  X2  %  V  x\  X2  x$  V  Xi  X2  X3  V  X\  X2  x$  V  X\  X2  x$ 

It  is  easy  to  see  that  the  first  and  second  terms  combine  to  give  5:1X3,  the  first  and  third  give 
X2X3  (we  use  the  property  that  g  V  g  =  g),  and  the  last  two  give  X\X$.  That  is,  we  can  write 
the  following  SOPE  for  / : 


/  =  X\  X3  V  X\  X3  V  X2X3  (2.3) 

Clearly,  we  could  have  stopped  before  any  one  of  the  above  simplifications  was  used  and  gen¬ 
erated  another  SOPE  for  /.  This  illustrates  the  point  that  a  Boolean  function  may  have  many 
SOPEs  but  only  one  DNF. 

A  sum  in  the  variables  Xj, ,  Xj2, . . . ,  x,;fc  is  the  OR  of  each  of  these  variables  or  their  nega¬ 
tions.  For  example,  13  V  X4  V  X7  is  a  sum.  A  maxterm  is  a  product  that  contains  each  of  the 
variables  of  a  function.  A  sum  of  a  Boolean  function  /  is  a  sum  in  some  of  the  variables  of 
/.  A  product-of-sum  expansion  (POSE)  of  a  Boolean  function  is  the  AND  (the  product)  of 
sums  of  /.  Thus,  the  CNF  is  a  special  case  of  the  POSE  of  a  function. 

A  POSE  of  a  Boolean  function  can  be  obtained  by  simplifying  the  CNF  of  a  function 
using  the  rules  given  in  Section  2.2.4.  For  example,  the  conjunction  of  the  two  maxterms 
Xi  V  X2V  X3  and  Xi  V  X2V  X3,  namely  (xi  V  X2  V  5:3)  A  (xi  V  5:2  V  X3),  can  be  reduced  to 
Xi  V  X2  by  the  application  of  rules  of  Section  2.2.4,  as  shown  below: 

(xi  V  5:2  V  X3)  A  (xi  V  5:2  V  X3)  = 


©John  E  Savage 


2.3  Normal-Form  Expansions  of  Boolean  Functions 


45 


=  X\  V  ( X2  V  X3)  A  ( X2  V  X3)  {3rd  distribudvity  rule} 
=  X\  V  X2  V  (X3  A  0:3)  {3rd  distributivity  rule} 

=  X\  V  X2  V  0  {6th  absorption  rule} 

=  Xi  V  X2  {1st  rule  on  substitution  of  constants} 

It  is  easily  shown  that  the  POSE  of  the  parity  function  is  its  CNE  (See  Problem  2.8.) 

2.3.4  Ring-Sum  Expansion 

The  ring-sum  expansion  (RSE)  of  a  function  /  is  the  EXCLUSIVE  OR  (©)  of  a  constant 
and  products  (A)  of  unnegated  variables  of  /.  For  example,  1  ©  X1X3  ©  X2X4  is  an  RSE. 
The  operations  ©  and  A  over  the  set  B  =  {0,  1}  constitute  a  ring.  (Rings  are  examined  in 
Section  6.2.1.)  Any  two  instances  of  the  same  product  in  the  RSE  can  be  eliminated  since  they 
sum  to  0. 

The  RSE  of  a  Boolean  function  /  :  Bn  1— >  B  can  be  constructed  from  its  DNF,  as  we 
show.  Since  a  minterm  of  /  has  value  1  on  exactly  one  of  the  2"  points  in  its  domain,  at 
most  one  minterm  in  the  DNF  for  /  has  value  1  for  any  point  in  its  domain.  Thus,  we 
can  combine  minterms  with  EXCLUSIVE  OR  instead  of  OR  without  changing  the  value  of  the 
function.  Now  replace  X,;  with  Xi  ©  1  in  each  minterm  containing  X;  and  then  apply  the 
second  distributivity  rule.  We  simplify  the  resulting  formula  by  using  commutativity  and  the 
absorption  rule  Xj  ©  X,  =  0.  For  example,  since  the  minterms  of  (xi  V  X2)X3  are  X1X2X3, 
X1X2X3,  and  X1X2X3,  we  construct  the  RSE  of  this  function  as  follows: 

(Xi  V  X2)X3  =  X1X2X3  ©  X1X2X3  ©  X1X2X3 

=  (Xi  ©  1)X2X3  ©  (Xi  ©  1)(X2  ©  l)x'3  ©  X1X2X3 
=  X2X3  ©  X1X2X3  ©  X3  ©  X1X3  ©  X2X3  ©  X1X2X3  ©  X1X2X3 
=  X3  ©  X1X3  ©  X1X2X3 

The  third  equation  follows  by  applying  the  second  distributivity  rule  and  commutativity.  The 
fourth  follows  by  applying  x,  ©  Xi  =  0  and  commutativity.  The  two  occurrences  of  X2X3  are 
canceled,  as  are  two  of  the  three  instances  of  x  1X2X3. 

As  this  example  illustrates,  the  RSE  of  a  function  /  :  Bn  <— >  B  is  the  EXCLUSIVE  OR  of 
a  Boolean  constant  Co  and  one  or  more  products  of  unnegated  variables  of  /.  Since  each  of 
the  n  variables  of  /  can  be  present  or  absent  from  a  product,  there  are  2"  products,  including 
the  product  that  contains  no  variables;  that  is,  a  constant  whose  value  is  0  or  1.  For  example, 
1  ©  X3  ©  X1X3  ©  X1X2X3  is  the  RSE  of  the  function  (x\  V  X2)  X3. 

2.3.5  Comparison  of  Normal  Forms 

It  is  easy  to  show  that  the  RSE  of  a  Boolean  function  is  unique  (see  Problem  2.7).  However,  the 
RSE  is  not  necessarily  a  compact  representation  of  a  function.  For  example,  the  RSE  of  the  OR 
of  n  variables,  /© ,  includes  every  product  term  except  for  the  constant  1 .  (See  Problem  2.9.) 

It  is  also  true  that  some  functions  have  large  size  in  some  normal  forms  but  small  size  in 
others.  For  example,  the  parity  function  has  exponential  size  in  the  DNF  and  CNF  normal 
forms  but  linear  size  in  the  RSE.  Also,  fy‘  ^  has  exponential  size  in  the  RSE  but  linear  size  in 
the  CNF  and  SOPE  representations. 
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A  natural  question  to  ask  is  whether  there  is  a  function  that  has  large  size  in  all  five  normal 
forms.  The  answer  is  yes.  This  is  true  of  the  Boolean  function  on  n  variables  whose  value  is  1 
when  the  sum  of  its  variables  is  0  modulo  3  and  is  0  otherwise.  It  has  exponential-size  DNF, 
CNF,  and  RSE  normal  forms.  (See  Problem  2.10.)  Flowever,  its  smallest  circuit  is  linear  in  n. 
(See  Section  2.11.) 

2.4  Reductions  Between  Functions 

A  common  way  to  solve  a  new  problem  is  to  apply  an  existing  solution  to  it.  For  example,  an 
integer  multiplication  algorithm  can  be  used  to  square  an  integer  by  supplying  two  copies  of 
the  integer  to  the  multiplier.  This  idea  is  called  a  “reduction”  in  complexity  theory  because  we 
reduce  one  problem  to  a  previously  solved  problem,  here  squaring  to  integer  multiplication.  In 
this  section  we  briefly  discuss  several  simple  forms  of  reduction,  including  subfunctions.  Note 
that  the  definitions  given  below  are  not  limited  to  binary  functions. 

DEFINITION  2.4. 1  A  function  f  :  A'  i— >  A"'  is  a  reduction  to  the  function  g  :  A  t— >  As 
through  application  of  the  functions  p  :  A*  t— *  Am  and  q  :  An  i— >  Ar  if  for  all  x  £  An: 

f(x)  =  p(g(q(x))) 

As  suggested  in  Fig.  2.4,  it  follows  that  circuits  for  q,  g  and  p  can  be  cascaded  (the  output 
of  one  is  the  input  to  the  next)  to  form  a  circuit  for  /.  Thus,  the  circuit  size  and  depth  of  /, 
C(f)  and  D(f),  satisfy  the  following  inequalities: 

C(f)  <  C(p)  +  C(g)  +  C(q) 

D{f)  <  D[p)  +  D{g)  +  D(q ) 

A  special  case  of  a  reduction  is  the  subfunction,  as  defined  below. 

DEFINITION  2.4.2  Letg  :  An  i— >  Am.  A  subfunction  /  ofg  is  a  function  obtained  by  assigning 
values  to  some  of  the  input  variables  of  g,  assigning  ( not  necessarily  unique)  variable  names  to  the 
rest,  deleting  and/or  permuting  some  of  its  output  variables.  We  say  that  f  is  a  reduction  to  g  via 
the  subfunction  relationship. 


/ 


f{x)  =p{g{q{x))) 


Figure  2.4  The  function  /  is  reduced  to  the  function  g  by  applying  functions  p  and  q  to  prepare 
the  input  to  g  and  manipulate  its  output. 


©John  E  Savage 


2.5  Specialized  Circuits 


47 


Figure  2.5  The  subfunction  /  of  the  function  g  is  obtained  by  fixing  some  input  variables, 
assigning  names  to  the  rest,  and  deleting  and/or  permuting  outputs. 


/  o  nS 

This  definition  is  illustrated  by  the  function  /example  >  x2>  xi)  =  (yuVi)  in  Fig-  2.2. 

We  form  the  subfunction  yj  by  deleting  y2  from  /exfmple  an<^  ^ng  xi  =  a,  x2  =  1,  and 
£3  =  b,  where  a  and  b  are  new  variables.  Then,  consulting  (2.3),  we  see  that  y\  can  be  written 
as  follows: 


y  1  =  (a  b)  V  (a  b)  V  (16) 

=  abVab 
=  a  ®  6  ®  1 

That  is,  ?/i  contains  the  complement  of  the  EXCLUSIVE  OR  function  as  a  subfunction.  The 
definition  is  also  illustrated  by  the  reductions  developed  in  Sections  2.5.2,  2.5.6,  2.9.5,  and 
2.10.1. 

The  subfunction  definition  derives  its  importance  from  the  following  lemma.  (See  Fig.  2.5.) 


LEMMA  2.4. 1  Iff  is  a  subfunction  of  g,  a  straight-line  program  for  f  can  be  created  from  one 
for  g  without  increasing  the  size  or  depth  of  its  circuit. 

As  shown  in  Section  2.9.5,  the  logical  shifting  function  (Section  2.5.1)  can  be  realized 
by  composing  the  integer  multiplication  and  decoder  functions  (Section  2.5).  This  type  of 
reduction  is  useful  in  those  cases  in  which  one  function  is  reduced  to  another  with  the  aid  of 
functions  whose  complexity  (size  or  depth  or  both)  is  known  to  be  small  relative  to  that  of 
either  function.  It  follows  that  the  two  functions  have  the  same  asymptotic  complexity  even  if 
we  cannot  determine  what  that  complexity  is.  The  reduction  is  a  powerful  idea  that  is  widely 
used  in  computer  science.  Not  only  is  it  the  essence  of  the  subroutine,  but  it  is  also  used  to 
classify  problems  by  their  time  or  space  complexity.  (See  Sections  3.9.3  and  8.7.) 

2.5  Specialized  Circuits 

A  small  number  of  special  functions  arise  repeatedly  in  the  design  of  computers.  These  include 
logical  and  shifting  operations,  encoders,  decoders,  multiplexers,  and  demultiplexers.  In  the 
following  sections  we  construct  efficient  circuits  for  these  functions. 
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Figure  2.6  A  balanced  binary  tree  circuit  that  combines  elements  with  an  associative  operator. 


2.5.1  Logical  Operations 

Logical  operations  are  not  only  building  blocks  for  more  complex  operations,  but  they  are 
at  the  heart  of  all  central  processing  units.  Logical  operations  include  “vector”  and  “asso¬ 
ciating”  operations.  A  vector  operation  is  the  component-wise  operation  on  one  or  more 
vectors.  For  example,  the  vector  NOT  on  the  vector  x  =  ( xn-\ , . . . ,  X\,  Xo)  is  the  vector 
x  =  (xn_i, .  .  . ,  Xi,x0).  Other  vector  operations  involve  the  application  of  a  two-input  func¬ 
tion  to  corresponding  components  of  two  vectors.  If  *  is  a  two-input  function,  such  as  AND 
or  OR,  and  x  =  (xn-\, .  . . ,  X\,  Xo )  and  y  =  {yn- 1,  •  ■  • ,  J/i,  J/o)  are  two  n-tuples,  the  vector 
operation  x-ky  is 

x*y=  (in_i*y„_i . ii*yi,i0*yo) 

An  associative  operator  ©  over  a  A  satisfies  the  condition  (a  ©  b)  0  c  =  a  0  (b  ©  c)  for  all 
a,b,c  £  A.  A  summing  operation  on  an  n-tuple  x  with  an  associative  two-input  operation 
©  produces  the  “sum”  y  defined  below. 

y  =  xn—i  0  •  •  •  ©  Xi  0  x0 

An  efficient  circuit  for  computing  y  is  shown  in  Fig.  2.6.  It  is  a  binary  tree  whose  leaves  are 
associated  with  the  variables  xn-\, . . . ,  X\,Xq.  Each  level  of  the  tree  is  full  except  possibly 
the  last.  This  circuit  has  smallest  depth  of  those  that  form  the  associative  combination  of  the 
variables,  namely  |~log2  n] . 

2.5.2  Shifting  Functions 

Shifting  functions  can  be  used  to  multiply  integers  and  generally  manipulate  data.  A  cyclic 
shifting  function  rotates  the  bits  in  a  word.  For  example,  the  left  cyclic  shift  of  the  4-tuple 
(1,  0,  0,  0)  by  three  places  produces  the  4-tuple  (0,  1, 0,  0). 

The  cyclic  shifting  function  /c"c\ic  :  Bn+  ^loS2  t— >  Bn  takes  as  input  an  n-tuple  x  = 
( xn-\ , . . . ,  X\,  Xo)  and  cyclically  shifts  it  left  by  js|  places,  where  |s|  is  the  integer  associated 
with  the  binary  fc-tuple  s  =  (sk- i,  . . . ,  s i,  So),  k  =  flog2  n\ ,  and 

k- 1 

lSl  =^S323 

3=0 

The  n-tuple  that  results  from  the  shift  is  y  =  (yn- 1,  •  ■  ■ ,  y\,  yo),  denoted  as  follows: 

V  =  fcyluci*’  S ) 
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2/7  2/6  2/5  2/4  2/3  2/2  2/1  2/0 


s  2 


Si 


So 


Figure  2.7  Three  stages  of  a  cyclic  shifting  circuit  on  eight  inputs. 


A  convenient  way  to  perform  the  cyclic  shift  of  x  by  |  s  |  places  is  to  represent  s  as  a  sum 
of  powers  of  2,  as  shown  above,  and  for  each  0  <  j  <  k  —  1,  shift  x  left  cyclically  by  Sj2J 
places,  that  is,  by  either  0  or  2 3  places  depending  on  whether  Sj  =  0  or  1.  For  example, 
consider  cyclically  shifting  the  8-tuple  u  =  (1,  0,  1, 1,  0, 1,  0,  1)  by  seven  places.  Since  7  is 
represented  by  the  binary  number  (1,  1, 1),  that  is,  7  =  4  +  2+  1,  to  shift  (1,  0, 1,  1, 0, 1,  0,  1) 
by  seven  places  it  suffices  to  shift  it  by  one  place,  by  two  places,  and  then  by  four  places.  (See 
Fig.  2.7.) 

For  0  <  r  <  n  —  1,  the  following  formula  defines  the  value  of  the  rth  output,  yr,  of  a 
circuit  on  n  inputs  that  shifts  its  input  x  left  cyclically  by  either  0  or  2J  places  depending  on 
whether  s j  =  0  or  1 : 

Ur  (*ZV  A  Sj  )  V  (a?(r_2J)  mod  n  A  Sj  ) 

Thus,  yr  is  xr  in  the  first  case  or  Xjr_2i)  mod  n  in  the  second.  The  subscript  (r  —  23)  mod  n 
is  the  positive  remainder  of  (r  —  2°)  after  division  by  n.  For  example,  if  n  =  4,  r  =  1,  and 
j  =  1,  then  (r  —  2J)  =  —1,  which  is  3  modulo  4.  That  is,  in  a  circuit  that  shifts  by  either  0 
or  21  places,  y\  is  either  X\  or  x$  because  x 3  moves  into  the  second  position  when  shifted  left 
cyclically  by  two  places. 

A  circuit  based  on  the  above  formula  that  shifts  by  either  0  or  2J  places  depending  on 
whether  Sj  =  0  or  1  is  shown  in  Fig.  2.8  for  n  =  4.  The  circuit  on  n  inputs  has  3n  +  1  gates 
and  depth  3. 

It  follows  that  a  circuit  for  cyclic  shifting  an  n-tuple  can  be  realized  in  k  =  [log2  n ]  stages 
each  of  which  has  3n  +  1  gates  and  depth  3,  as  suggested  by  Fig.  2.1 .  Since  this  may  be  neither 
the  smallest  nor  the  shallowest  circuit  that  computes  /(.©ic  :  ,gn+l"los 2  >  jts  minimal  circuit 
size  and  depth  satisfy  the  following  bounds. 


X-j  X(S  £5  X4  x 3  X2  X\  Xq 


50 


Chapter  2  Logic  Circuits 


Models  of  Computation 


CC3  X2  X\  x0 

Figure  2.8  One  stage  of  a  circuit  for  cyclic  shifting  four  inputs  by  0  or  2  places  depending  on 
whether  Si  =  0  or  1. 


LEMMA  2.5. 1  The  cyclic  shifting  function  f^yhic  '■  Bn+^°Sin^  t— >  Bn  can  be  realized  by  a 
circuit  of  the  following  size  and  depth  over  the  basis  =  {A,  V,  ^}: 

C ^  (/iyclic)  <  (3n  +  1 )  |"log2  n] 

D^o  (/iyck)  <3riog2n] 

The  logical  shifting  function  f^{t  :  Bn+  flog2  n "i  i— >  Bn  shifts  left  the  n-tuple  x  by 
a  number  of  places  specified  by  a  binary  [log  n] -tuple  s,  discarding  the  higher-index  com¬ 
ponents,  and  filling  in  the  lower-indexed  vacated  places  with  0’s  to  produce  the  n- tuple  y, 
where 


Vi  = 


Xj-\s\  for  \s\<j<n-l 
0  otherwise 


REDUCTIONS  BETWEEN  LOGICAL  AND  CYCLIC  SHIFTING  The  logical  shifting  function  : 
Bn+  riog2  i— >  Bn  on  the  n-tuple  x  is  defined  below  in  terms  of  /Cyclic:  and  t^le  “projection” 

function  :  Bln  *— >  Bn  that  deletes  the  n  high  order  components  from  its  input  2n-tuple. 
Here  0  denotes  the  zero  binary  n-tuple  and  0  •  x  denotes  the  concatenation  of  the  two  strings. 
(See  Figs.  2.9  and  2.10.) 


f(n) 

J  shift 


(*>») 


0 

0 

0 

0 

0 

x 7 

x6 

x5 

X4 

X3 

x2 

X\ 

Xq 

0 

0 

0 

Figure  2.9  The  reduction  of  to  ffylhc  obtained  by  cyclically  shifting  0  ■  33  by  three  places 
and  projecting  out  the  shaded  components. 
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An) 

J  cyclic 


Figure  2.10  The  function  /cyclic  is  obtained  by  computing  on  xx  and  truncating  the  n 
low-order  bits. 


LEMMA  2.5.2  The  function  /cyclic  contains  t  as  a  subfunctiori  and  the  function  con 


(n) 


f(2n) 


tains  /cyc\ic  as  a  subfunction. 

Proof  The  first  statement  follows  from  the  above  argument  concerning  /©jt 
statement  follows  by  noting  that 


.  The  second 


/Stic(*>  s)  =  (/i s 


where  7r 


(n) 

H 


deletes  the  n  low-order  components  of  its  input.  ■ 


This  relationship  between  logical  and  cyclic  shifting  functions  clearly  holds  for  variants 
of  such  functions  in  which  the  amount  of  a  shift  is  specified  with  some  other  notation.  An 
example  of  such  a  shifting  function  is  integer  multiplication  in  which  one  of  the  two  arguments 
is  a  power  of  2. 


2.5.3  Encoder 

The  encoder  function  /©c)de  :  B2  i— >  Bn  has  2"  inputs,  exactly  one  of  which  is  1 .  Its 
output  is  an  n-tuple  that  is  a  binary  number  representing  the  position  of  the  input  that  has 
value  1.  That  is,  it  encodes  the  position  of  the  input  bit  that  has  value  1.  Encoders  are  used  in 
CPUs  to  identify  the  source  of  external  interrupts. 

Let  x  =  ■  ■  ■ ,  X2,  X\,  Xq)  represent  the  2"  inputs  and  let  y  =  {yn- 1,  ■  •  ■  ,  J/i,  Vo) 

represent  the  n  outputs.  Then,  we  write  f^lode(x)  =  V- 

When  n  =  1,  the  encoder  function  has  two  inputs,  X\  and  Xo,  and  one  output,  yo,  whose 
value  is  yo  =  Xi  because  if  Xo  =  1,  then  x\  =  0  and  yo  =  0  is  the  binary  representation  of 
the  input  whose  value  is  1.  Similar  reasoning  applies  when  xo  =  0. 

When  n  >  2,  we  observe  that  the  high-order  output  bit,  yn- i,  has  value  1  if  1  falls  among 
the  variables  X2«-i, . . . ,  x2n-i+i,  x2n-\ ■  Otherwise,  yn-\  =  0.  Thus,  yn-\  can  be  computed 
as  the  OR  of  these  variables,  as  suggested  for  the  encoder  on  eight  inputs  in  Fig.  2.11. 

The  remaining  n  —  1  output  bits,  yn-2>  ■  ■  ■  >  1/1.  Vo>  represent  the  position  of  the  1  among 
variables  x2n-t-\,  ■  ■  ■ ,  x2,  Xi,Xq  if  yn- 1  =  0  or  the  1  among  variables  x2n-i,  •  ■  ■ ,  x2*-i+i, 
x2n-t  if  yn- 1  =  1.  For  example,  for  n  =  3  if  x  =  (0,  0,  0,  0,  0,  0,  1,0),  then  y2  =  0  and 
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Figure  2. 1  I  The  recursive  construction  of  an  encoder  circuit  on  eight  inputs. 


(2/i>2/o)  =  (0,  1),  whereas  if  x  =  (0,  0,  1,  0,  0,  0,  0,  0),  then  y2  =  land  ( yi,yo )  =  (0,1). 
Thus,  after  computing  yn-i  as  the  OR  of  the  2””1  high-order  inputs,  the  remaining  output 
bits  can  be  obtained  by  supplying  to  an  encoder  on  2n_1  inputs  the  2n_1  low-order  bits  if 
yn- 1  =  0  or  the  2"_1  high-order  bits  if  yn-i  =  1.  It  follows  that  in  both  cases  we  can 
supply  the  vector  6  =  ( X2n-\  V  x2(n-\)_1,  x2^-i  V  x2(n- 1)_2,  •  ■  • ,  x2(n-ii  V  Xq)  of  2(ra_L 
components  to  the  encoder  on  2^™-1^  inputs.  This  is  illustrated  in  Fig.  2.11. 

Let’s  now  derive  upper  bounds  on  the  size  and  depth  of  the  optimal  circuit  for  facade- 
Clearly  Cq0  (/encode)  =  0  and  D^o  (/encode)  =  °>  since  no  gates  are  needed  in  this  case. 
From  the  construction  described  above  and  illustrated  in  Fig.  2. 1 1 ,  we  see  that  we  can  construct 
a  circuit  for  f^lode  *n  a  twQ'Step  process.  First,  we  form  yn-\  as  the  OR  of  the  2n~x  high- 
order  variables  in  a  balanced  OR  tree  of  depth  n  using  2ra_1  —  1  ORs.  Second,  we  form 
the  vector  5  with  a  circuit  of  depth  1  using  2ra_1  ORs  and  supply  it  to  a  copy  of  a  circuit 
for  /enco^g.  This  provides  the  following  recurrences  for  the  circuit  size  and  depth  of  f^2ode 
because  the  depth  of  this  circuit  is  no  more  than  the  maximum  of  the  depth  of  the  OR  tree  and 
1  more  than  the  depth  of  a  circuit  for  /encode'- 

(/encode)  <  2”  -  1  +  Cn0  (2.4) 

A-2„  (/encode)  <  maX(«  “  1  -  ^^0  (/encode)  +  l)  (2-5) 

The  solutions  to  these  recurrences  are  stated  as  the  following  lemma,  as  the  reader  can  show. 
(See  Problem  2.14.) 

LEMMA  2.5.3  The  encoder  function  /encode  ^as  f°^owing  circuit  size  and  depth  bounds: 

(/inide)  <2"+1-(n  +  3) 

DOo  (/encode)  <U~l 
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2.5.4  Decoder 

A  decoder  is  a  function  that  reverses  the  operation  of  an  encoder:  given  an  n-bit  binary  address, 
it  generates  2"  bits  with  a  single  1  in  the  position  specified  by  the  binary  number.  Decoders 
are  used  in  the  design  of  random-access  memory  units  (see  Section  3.5)  and  of  the  multiplexer 
(see  Section  2.5.5). 

The  decoder  function  /f©ode  :  Bn  i— >  B1  has  n  input  variables  x  =  (xn- i, . . . ,  X\,  Xq) 
and  2”  output  variables  y  =  (t/2"-i>  ■  •  ■ , 2/i>  2/o);  that  is,  fd™lode(x)  =  y.  Let  c  be  a  binary 
n- tuple  corresponding  to  the  integer  |c|.  All  components  of  the  binary  2” -tuple  y  are  zero 
except  for  the  one  whose  index  is  |c|,  namely  y\c\  -  Thus,  the  minterm  functions  in  the  variables 

x  are  computed  as  the  output  of  /decode- 

A  direct  realization  of  the  function  /decode  can  be  obtained  by  realizing  each  minterm 
independently.  This  circuit  uses  (2 n  —  1)2"  gates  and  has  depth  |~log2  n\  +  1.  Thus  we  have 
the  following  bounds  over  the  basis  fl0  =  {A,  V,  -i}: 

Cn0(f£lde)  <  (2n  —  1)2" 

Al„  (/decode)  <  1^2  n]  +  1 

A  smaller  upper  bound  on  circuit  size  and  depth  can  be  obtained  from  the  recursive  con¬ 
struction  of  Fig.  2.12,  which  is  based  on  the  observation  that  a  minterm  on  n  variables  is  the 
AND  of  a  minterm  on  the  first  n/2  variables  and  a  minterm  on  the  second  n/2  variables.  For 
example,  when  n  =  4,  the  minterm  X3  A  X2  A  X\  A  Xq  is  obviously  equal  to  the  AND  of  the 
minterm  X3  A  £2  in  the  variables  xd  and  X2  and  the  minterm  X\  AXo  in  the  variables  Xi  and  Xo . 
Thus,  when  n  is  even,  the  minterms  that  are  the  outputs  of  fd^ode  can  be  formed  by  ANDing 


44)  J  2/15  2/i4  2/13  2/12  2/n  2/io  2/9  2/s  2/7  2/6  2/5  2/4  2/3  2/2  2/i  2/o 

J  decode 


X3  X2  X\  X0 

Figure  2.12  The  construction  of  a  decoder  on  four  inputs  from  two  copies  of  a  decoder  on  two 
inputs. 
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every  minterm  generated  by  a  circuit  for  /dH  011  the  variables  xn/2_ 1 ,  •  •  • ,  Xo  with  every 

minterm  generated  by  a  circuit  for  /dH  011  the  variables  x„_i, . . .  ,xn/2,  as  suggested  in 
Fig.  2.12. 

The  new  circuit  for  /He  has  a  size  that  is  at  most  twice  that  of  a  circuit  for  /d H 
plus  2"  for  the  AND  gates  that  combine  minterms.  It  has  a  depth  that  is  at  most  1  more  than 
the  depth  of  a  circuit  for  /dH e-  Thus,  when  n  is  even  we  have  the  following  bounds  on  the 
circuit  size  and  depth  of  /d"code: 


Cn0  (/, 
Dn0  (/, 


(«)  \ 
decode / 


(n)  > 

decode  > 


<2Cn0  (4”/0l)+2 
<  Dn0  (/HI)  +  1 


Specializing  the  first  bounds  given  above  on  the  size  and  depth  of  a  decoder  circuit  to  one  on 
n/2  inputs,  we  have  the  bound  in  Lemma  2.5.4.  Furthermore,  since  the  output  functions  are 


all  different,  Cn0  (/decode)  is  at  least  2n. 


LEMMA  2.5.4  For  n  even  the  decoder  function  /do de  ^ as  t^]e  following  circuit  size  and  depth 
hounds: 


2"  <  Cq0  (/deiode)  <  2”  +  (2n  —  2)2™/2 

Dn0  (/decode)  ^  riog2n]  +  1 


The  circuit  size  bound  is  linear  in  the  number  of  outputs.  Also,  for  n  >  12,  the  exact  value  of 
Cn0  ( /decode)  h  known  to  within  25%.  Since  each  output  depends  on  n  inputs,  we  will  see 
in  Chapter  9  that  the  upper  bound  on  depth  is  exactly  the  depth  of  the  smallest  depth  circuit 
for  the  decoder  function. 


2.5.5  Multiplexer 

The  multiplexer  function  fmfo  :  B1  +n  ^  B  has  two  vector  inputs,  z  =  [z2n- 1,  ■  •  ■ ,  Z\, 
Zo)  and  x  =  (xn_i, . . . ,  Xi,  Xo),  where  x  is  treated  as  an  address.  The  output  of  /mux  is 
v  =  Zj,  where  j  =  \x\  is  the  integer  represented  by  the  binary  number  x.  This  function  is 
also  known  as  the  storage  access  function  because  it  simulates  the  access  to  storage  made  by  a 
random-access  memory  with  one-bit  words.  (See  Section  3.5.) 

The  similarity  between  this  function  and  the  decoder  function  should  be  apparent.  The 
decoder  function  has  n  inputs,  x  =  (x„_i, . . .  ,X\,  xf),  and  2n  outputs,  y  —  (y2n-\,  ■  ■  ■  >  2/i» 
yo),  where  yj  =  1  if  j  =  |x|  and  yj  =  0  otherwise.  Thus,  we  can  form  v  =  Zj  as 

v  =  {z2n-i  A  t/2n— t)  V  ■  •  •  V  (zi  A  yf)  V  (z0  A  y0) 

This  circuit  uses  a  circuit  for  the  decoder  function  /'He  plus  2"  AND  gates  and  2n  —  1 
OR  gates.  It  adds  a  depth  of  n  +  1  to  the  depth  of  a  decoder  circuit.  Lemma  2.5.5  follows 
immediately  from  these  observations. 
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LEMMA  2.5.5  The  multiplexer  function  /mux  :  B2  +n  i— >  B  can  be  realized  with  the  following 
circuit  size  and  depth  over  the  basis  Do  =  {A,  V,  ->}  : 

Cn0(f il)  <  3  •  2n  +  2(n  —  l)2n^2  —  1 

Dn0  (/iul)  <  n  +  riog2  n]+  2 

Using  the  lower  bound  of  Theorem  9.3.3,  one  can  show  that  it  is  impossible  to  reduce 
the  upper  bound  on  circuit  size  to  less  than  2"+1  —  2.  At  the  cost  of  increasing  the  depth  by 
1,  the  circuit  size  bound  can  be  improved  to  about  2"+1.  (See  Problem  2.15.)  Since  /mux 
depends  on  2”  +  n  variables,  we  see  from  Theorem  9.3.1  that  it  must  have  depth  at  least 
log2(2n  +  n)  >  n.  Thus,  the  above  depth  bound  is  very  tight. 

2.5.6  Demultiplexer 

The  demultiplexer  function  /j"^  :  Bn+l  i— >  B2  is  very  similar  to  a  decoder.  It  has  n  +  1 
inputs  consisting  of  n  bits,  x,  that  serve  as  an  address  and  a  data  bit  e.  It  has  2"  outputs  y  all 
of  which  are  0  if  e  =  0  and  one  output  that  is  1  if  e  =  1,  namely  the  output  specified  by  the 
n  address  bits.  Demultiplexers  are  used  to  route  a  data  bit  (e)  to  one  of  2"  output  positions. 

A  circuit  for  the  demultiplexer  function  can  be  constructed  as  follows.  First,  form  the  AND 
of  e  with  each  of  the  n  address  bits  xn-i, . . . ,  X\,  Xo  and  supply  this  new  n-tuple  as  input  to 
a  decoder  circuit.  Let  z  =  (z2n_i, . . . ,  Z\,  zo)  be  the  decoder  outputs.  When  e  =  0,  each  of 
the  decoder  inputs  is  0  and  each  of  the  decoder  outputs  except  Zo  is  0  and  Zo  =  1 .  If  we  form 
the  AND  of  Zq  with  e,  this  new  output  is  also  0  when  e  =  0.  If  e  =  1,  the  decoder  input  is  the 
address  x  and  the  output  that  is  1  is  in  the  position  specified  by  this  address.  Thus,  a  circuit 
for  a  demultiplexer  can  be  constructed  from  a  circuit  for  fd'^ode  to  which  are  added  n  AND 
gates  on  its  input  and  one  on  its  output.  This  circuit  has  a  depth  that  is  at  most  2  more  than 
the  depth  of  the  decoder  circuit.  Since  a  circuit  for  a  decoder  can  be  constructed  from  one 
for  a  demultiplexer  by  fixing  e  =  1 ,  we  have  the  following  bounds  on  the  size  and  depth  of  a 
circuit  for  /<^ux. 

LEMMA  2.5.6  The  demultiplexer  function  /demux  •  Bn+l  i— >  B2  can  be  realized  with  the 
following  circuit  size  and  depth  over  the  basis  Do  =  {  A,  V,  ©: 

0  <  Cn0  (/j&L)  -  Co0  (/i:c}oder)  <  n+  1 

0  <  Dno  (/^L)  -  DQo  (f^codei)  <  2 

2.6  Prefix  Computations 

The  prefix  computation  first  appeared  in  the  design  of  logic  circuits,  the  goal  being  to  paral¬ 
lelize  as  much  as  possible  circuits  for  integer  addition  and  multiplication.  The  carry-lookahead 
adder  is  a  fast  circuit  for  integer  addition  that  is  based  on  a  prefix  computation.  (See  Sec¬ 
tion  2.7.)  Prefix  computations  are  now  widely  used  in  parallel  computation  because  they 
provide  a  standard,  optimizable  framework  in  which  to  perform  computations  in  parallel. 

The  prefix  function  vlf  '1  :  An  i— >  An  on  input  x  =  (x\,  X2,  ■  ■  ■ ,  xn)  produces  as 
output  y  =  (yi,  yi,  ■  ■  ■ ,  2/n),  which  is  a  running  sum  of  its  n  inputs  x  using  the  operator 
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0  as  the  summing  operator.  That  is,  yj  =  X\  ©  X2  0  ■  ■  •  ©  Xj  for  1  <  j  <  n.  Thus,  if 
the  set  A  is  IN,  the  natural  numbers,  and  ©  is  the  integer  addition  operator  +,  then 
on  the  input  x  =  (xi,  X2,  •  •  • ,  xn)  produces  the  output  y,  where  y\  =  X\,  t/2  =  X\  +  X2, 
y$  =  X\  +  X2  +  X3,  ■  •  ■ ,  yn  =  Xi  +  X2  +  •  •  •  +  xn.  For  example,  shown  below  is  the  prefix 
function  on  a  6-vector  of  integers  under  integer  addition. 

x  =  (2, 1,3, 7, 5,1) 

Vf\x)  =  (2,3,6,13,18,19) 

A  prefix  function  is  defined  only  for  operators  ©  that  are  associative  over  the  set  A.  An 
operator  over  A  is  associative  if  a)  for  all  a  and  b  in  A,  a  ©  b  is  in  A,  and  b)  for  all  a,  b,  and 
cm  A,  (aQb)  Qc=  aQ(b<3  c) — that  is,  if  all  groupings  of  terms  in  a  sum  with  the  operator 
©  have  the  same  value.  A  pair  (A,  ©)  in  which  ©  is  associative  is  called  a  semigroup.  Three 
semigroups  on  which  a  prefix  function  can  be  defined  are 

•  (IN,  +)  where  IN  are  the  natural  numbers  and  +  is  integer  addition. 

•  ({0,  1}*,  •)  where  {0,  1}*  is  the  set  of  binary  strings  and  •  is  string  concatenation. 

•  (-4,  ©copy)  where  ^4  is  a  set  and  0COpy  is  defined  by  a  0COpy  b  =  a. 

It  is  easy  to  show  that  the  concatenation  operator  ■  on  {0,  1}*  and  ©COpy  on  a  set  A  are 
associative.  (See  Problem  2.20.)  Another  important  semigroup  is  the  set  of  matrices  under 
matrix  multiplication  (see  Theorem  6.2.1). 

Summarizing,  if  ( A ,  ©)  is  a  semigroup,  the  prefix  function  Vq  1  :  An  1— >  An  on  input 
x  =  (xi,  X2,  ■  ■  ■ ,  xn)  produces  as  output  y  =  (t/i,  t/2.  ■  •  ■  >  Vn)>  where  yj  =  X\QX2&-  •  ■ &Xj 
for  1  <  j  <  n. 

Load  balancing  on  a  parallel  machine  is  an  important  application  of  prefix  computation. 
A  simple  example  of  load  balancing  is  the  following:  We  assume  that  p  processors,  numbered 
from  0  to  p  —  1,  are  running  processes  in  parallel.  We  also  assume  that  processes  are  born 
and  die,  resulting  in  a  possible  imbalance  in  the  number  of  processes  active  on  processors. 
Since  it  is  desirable  that  all  processors  be  running  the  same  number  of  processes,  processes 
are  periodically  redistributed  among  processors  to  balance  the  load.  To  rebalance  the  load,  a) 
processors  are  given  a  linear  order,  and  b)  each  process  is  assigned  a  Boolean  variable  with  value 
1  if  it  is  alive  and  0  otherwise.  Each  processor  computes  its  number  of  living  processes,  n,.  A 
prefix  computation  is  then  done  on  these  values  using  the  linear  order  among  processors.  This 
computation  provides  the  jth  processor  with  the  sum  rij  +  n:j- 1  +  •  •  •  +  n\  which  it  uses  to 
give  each  of  its  living  processes  a  unique  index.  The  sum  n  =  np  +  ■  ■  ■  +  ri\  is  then  broadcast 
to  all  processors.  When  the  processors  are  in  balance  all  have  \n/p~\  processes  except  possibly 
one  that  has  fewer  processes.  Assigning  the  sth  process  to  processor  (s  mod  p)  insures  that 
the  load  is  balanced. 

Another  important  type  of  prefix  computation  is  the  segmented  prefix  computation.  In 
this  case  two  n-vectors  are  given,  a  value  vector  x  and  a  flag  vector  (j).  The  value  of  the  nh 
entry  y,  in  the  result  vector  y  is  Xj  if  ( fa  is  1  and  otherwise  is  the  associative  combination  with 
©  of  Xi  and  the  values  between  it  and  the  first  value  Xj  to  the  left  of  Xi  for  which  the  flag 
4>j  —  1 .  The  first  bit  of  1 1>  is  always  1 .  An  example  of  a  segmented  prefix  computation  is  shown 
below  for  integer  values  and  integer  addition  as  the  associative  operation: 


x=  (2, 1,3, 7, 5,1) 
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0  =  (1,0,0, 1,0, 1) 
y=  (2,3,6,7,12,1) 

As  shown  in  Problem  2.21,  a  segmented  prefix  computation  is  a  special  case  of  a  general  prefix 
computation.  This  is  demonstrated  by  defining  a  new  associative  operation  ©  on  value-flag 
pairs  that  returns  another  value-flag  pair. 

2.6.1  An  Efficient  Parallel  Prefix  Circuit 

A  circuit  for  the  prefix  function  P©  can  be  realized  with  0{n2)  instances  of  ©  if  for  each 
l  <  j  <  nwe  naively  realize  yj  =  X\  0  £2  0  ■  •  •  ©  Xj  with  a  separate  circuit  containing  j  —  1 
instances  of  0.  If  each  such  circuit  is  organized  as  a  balanced  binary  tree,  the  depth  of  the 
circuit  for  P©  is  the  depth  of  the  circuit  for  yn ,  which  is  [log2  Tl\ .  This  is  a  parallel  circuit 
for  the  prefix  problem  but  uses  many  more  operators  than  necessary.  We  now  describe  a  much 
more  efficient  circuit  for  this  problem;  it  uses  O(n)  instances  of  ©  and  has  depth  0(log  n). 

To  describe  this  improved  circuit,  we  let  x\r,r\  =  xr  and  for  r  <  s  let  x\r,  s]  =  xr  0 
£r+i  ©  ■  •  •  ©  xs.  Then  we  can  write  P©(ai)  =  y  where  yj  = 

Because  ©  is  associative,  we  observe  that  x\r,  s]  =  x\r,  t]  ©  x[t  +  1,  s]  for  r  <  t  <  s. 
We  use  this  fact  to  construct  the  improved  circuit.  Let  n  =  2k.  Observe  that  if  we  form  the 
(n/2)-tuple  (a: [1,2],  £[3,4],  cc[5,  6], . . . ,  ir[2fc  —  1 , 2fc] )  using  the  rule  x[i,  i  +  1]  =  x[i,  i]  © 
x[i  +  l,i  +  1]  for  i  odd  and  then  do  a  prefix  computation  on  it,  we  obtain  the  (n/2)-tuple 
(x[l,  2],  x[l,  4],  x[l,  6], . . . ,  x[\,  2fc]).  This  is  almost  what  is  needed.  We  must  only  compute 
x[l,  1],  £[1,3],  £[l,  5],  . .  • ,  x[\,2k  —  1],  which  is  easily  done  using  the  rule  £[1,  2i  +  1]  = 
£[l,2?’]  ©  £2i+i  for  1  <  i  <  2fc~*  —  1.  (See  Fig.  2.13.)  The  base  case  for  this  construction  is 
that  of  n  =  1,  for  which  y\  =  X\  and  no  operations  are  needed. 

If  C(k)  is  the  size  of  this  circuit  on  n  =  2k  inputs  and  D(k)  is  its  depth,  then  (7(0)  =  0, 
D( 0)  =  0  and  C{k)  and  D(k)  for  k  >  1  satisfy  the  following  recurrences: 

C(k)  =  C(k-  1)  +2fc  -  1 
D(k)  =  D(k  -  1)  +2 

As  a  consequence,  we  have  the  following  result. 

THEOREM  2.6.1  For  n  =  2k ,  k  an  integer,  the  parallel  prefix  function  Pq”  ^  :  An  i— >  An  on  an 
n-vector  with  associative  operator  0  can  he  implemented  by  a  circuit  with  the  following  size  and 
depth  bounds  over  the  basis  12  =  {©}.' 

Cfi  ( 'Vq ^  <  2n  -  log2  n-2 
Dn  (Vq  })  <21og2n 

Proof  The  solution  to  the  recurrence  on  C(k)  is  C(k)  =  2k+1  -k-  2,  as  the  reader  can 
easily  show.  It  satisfies  the  base  case  of  k  =  0  and  the  general  case  as  well.  The  solution  to 
D(k)  is  D(k)  =  2k.  ■ 

When  n  is  not  a  power  of  2,  we  can  start  with  a  circuit  for  the  next  higher  power  of  2  and 
then  delete  operations  and  edges  that  are  not  used  to  produce  the  first  n  outputs. 
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*[1,2]  *[1,4]  *[1,6]  *[1,8] 


*1  *2  *3  *4  *5  *6  *7  *8 

Figure  2.13  A  simple  recursive  construction  of  a  prefix  circuit  when  n  =  2k  =  8.  The  gates 
used  at  each  stage  of  the  construction  are  grouped  into  individual  shaded  regions. 


2.7  Addition 

Addition  is  a  central  operation  in  all  general-purpose  digital  computers.  In  this  section  we 
describe  the  standard  ripple  adder  and  the  fast  carry-lookahead  addition  circuits.  The  ripple 
adder  mimics  the  elementary  method  of  addition  taught  to  beginners  but  for  binary  instead  of 
decimal  numbers.  Carry-lookahead  addition  is  a  fast  addition  method  based  on  the  fast  prefix 
circuit  described  in  the  preceding  section. 

Consider  the  binary  representation  of  integers  in  the  set  {0, 1, 2, . . . ,  2”  —  1}.  They  are 
represented  by  binary  n-tuples  u  =  (un- 1,  un- 2, . .  . ,  U\,Uq)  and  have  value 

n—  1 

M  =  uj23 

3= 0 

where  ^  denotes  integer  addition. 

The  addition  function  :  Bln  1— >  B"  + 1  computes  the  sum  of  two  binary  n-bit 

numbers  u  and  v,  as  shown  below,  where  +  denotes  integer  addition: 


n—  1 

M  +  M  =  J2(ui  +  Vi)23 

3= 0 

The  tuple  ((un-i  +wn_i),  (un- 2  +  ^-2),  •  ■  • ,  (uo  +  Wo))  h  not:  a  binary  number  because  the 
coefficients  of  the  powers  of  2  are  not  Boolean.  However,  if  the  integer  uq  +  Vo  is  converted  to 
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a  binary  number  (c\,  Sq),  where  c©  +  s0 2° 


Mo  +  Mo,  then  the  sum  can  be  replaced  by 


n—  1 

|tt|  +  | it |  =  'y  ' ( Uj  +  Vj) 22  +  (ui  +  +  Ci)2:  +  so2° 

3=2 

where  the  least  significant  bit  is  now  Boolean.  In  turn,  the  sum  U\+V\+C\  can  be  represented 
in  binary  by  ©,  Si),  where  C22  +  si  =  u\  +  V\  +  C\.  The  sum  |w|  +  © |  can  then  be  replaced 
by  one  in  which  the  two  least  significant  coefficients  are  Boolean.  Repeating  this  process  on  all 
coefficients,  we  have  the  ripple  adder  shown  in  Fig.  2.14. 

In  the  general  case,  the  jth  stage  of  a  ripple  adder  combines  the  jth  coefficients  of  each 
binary  number,  namely  Uj  and  Vj ,  and  the  carry  from  the  previous  stage,  Cj ,  and  represents 
their  integer  sum  with  the  binary  notation  (cj+ j,  Sj ),  where 


Cj+i2  Sj  —  Uj  T  Vj  T  Cj 


Here  Cj+\,  the  number  of  2’s  in  the  sum  Uj  +  Vj  +  Cj,  is  the  carry  into  the  (j  +  l)st  stage 
and  Sj,  the  number  of  l’s  in  the  sum  modulo  2,  is  the  external  output  from  the  jth  stage. 
The  circuit  performing  this  mapping  is  called  a  full  adder  (see  Fig.  2.15).  As  the  reader  can 
easily  show  by  constructing  a  table,  this  circuit  computes  the  function  /fa  :  £>3  >— >  B2,  where 
/fa(mj,  Vj,  Cj)  =  {cj+ 1,  Sj)  is  described  by  the  following  formulas: 


Pi 

=  Uj  ®  Vj 

9j 

=  ui  A  Vj 

ci+ 1 

=  ( Pj  A  Cj)  V  gj 

si 

=  Pj  ®  Cj 

(2.6) 


Here  Pj  and  gj  are  intermediate  variables  with  a  special  significance.  If  gj  =  1,  a  carry  is 
generated  at  the  jth  stage.  If  Pj  =  1,  a  carry  from  the  previous  stage  is  propagated  through 
the  jth  stage,  that  is,  a  carry-out  occurs  exactly  when  a  carry-in  occurs.  Note  that  pj  and  gj 
cannot  both  have  value  1 . 

The  full  adder  can  be  realized  with  five  gates  and  depth  3.  Since  the  first  full  adder  has 
value  0  for  its  carry  input,  three  gates  can  be  eliminated  from  its  circuit  and  its  depth  reduced 
by  2.  It  follows  that  a  ripple  adder  can  be  realized  by  a  circuit  with  the  following  size  and 
depth. 


S  4  S3  s2  Si  So 


V4  U4  V$  U$  V2  U2 

Figure  2.14  A  ripple  adder  for  binary  numbers. 
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THEOREM  2.7. 1  The  addition  function  :  B2n  i— >  i?"+1  can  he  realized  with  a  ripple  adder 
with  the  following  size  and  depth  bounds  over  the  basis  C2  =  {A,  V,  ©}: 

Ch(/i dd)  <5^-3 
(/iSd)  <3^-2 

(Do  the  ripple  adders  actually  have  depth  less  than  3 n  —  2?) 

2.7.1  Carry- Lookahead  Addition 

The  ripple  adder  is  economical;  it  uses  a  small  number  of  gates.  Unfortunately,  it  is  slow.  The 
depth  of  the  circuit,  a  measure  of  its  speed,  is  linear  in  n,  the  number  of  bits  in  each  integer. 
The  carry-lookahead  adder  described  below  is  considerably  faster.  It  uses  the  parallel  prefix 
circuit  described  in  the  preceding  section. 

The  carry-lookahead  adder  circuit  is  obtained  by  applying  the  prefix  operation  to  pairs 
in  B2  using  the  associative  operator  o  ;  ( B 2)2  i— >  B2  defined  below.  Let  (a,  b)  and  (c,  d)  be 
arbitrary  pairs  in  B2.  Then  O  is  defined  by  the  following  formula: 

(a,  b)  o  (c,  d)  —  (a  A  c,  {b  A  c)  V  d) 

To  show  that  o  is  associative,  it  suffices  to  show  by  straightforward  algebraic  manipulation  that 
for  all  values  of  a,  b,  c,  d,  e,  and  /  the  following  holds: 

((a,  b)  o  (c,  d))  o  (e,  /)  =  (a,  b)  o  ((c,  d)  o  (e,  /)) 

=  (ace,  bceWdeWf) 

Let  7t [j,j\  =  ( Pj ,  gf)  and,  for  j  <  k,  let  n[j,  k\  =  n[j,  k—  1]  0  7r [k,k\.  By  induction  it  is 
straightforward  to  show  that  the  first  component  of  7T  [j,  k]  is  1  if  and  only  if  a  carry  propagates 
through  the  full  adder  stages  numbered  j,  j  +  1 , ,k  and  its  second  component  is  1  if  and 
only  if  a  carry  is  generated  at  the  rth  stage,  j  <  r  <  k,  and  propagates  from  that  stage  through 
the  kth  stage.  (See  Problem  2.26.) 

The  prefix  computation  on  the  string  (7r[0,  0],  7r[l,  1  ],...,  7t[n  —  1,  n  —  1])  with  the  op¬ 
erator  o  produces  the  string  (7r[0,  0],  7r[0, 1],  7t[0,  2], . . . ,  7r[0,  n  —  1]).  The  first  component  of 
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7r  [0,  j]  is  1  if  and  only  if  a  carry  generated  at  the  zeroth  stage,  Cq,  is  propagated  through  the 
jth  stage.  Since  Cq  =  0,  this  component  is  not  used.  The  second  component  of  7r[0,  j],  Cj+ 1, 
is  1  if  and  only  if  a  carry  is  generated  at  or  before  the  jth  stage.  From  (2.6)  we  see  that  the 
sum  bit  generated  at  the  jth  stage,  Sj,  satisfies  Sj  =  pj  ®  Cj.  Thus  the  jth  output  bit,  Sj,  is 
obtained  from  the  EXCLUSIVE  OR  of  Pj  and  the  second  component  of  7r[0,  j  —  1]. 

THEOREM  2.7.2  For  n  =  2k,  k  an  integer,  the  addition  function  :  B2n  i— >  J3n+l  can 
be  realized  ivitb  a  carry-lookahead  adder  with  the  following  size  and  depth  bounds  over  the  basis 
Q.  =  {A,  V,  ®}: 


Ch  (/idd)  <  8 n 

(/idd)  <41og2n  +  2 

Proof  The  prefix  circuit  uses  2 n  —  log2  n  —  3  instances  of  o  and  has  depth  2  log2  n.  Since 
each  instance  of  o  can  be  realized  by  a  circuit  of  size  3  and  depth  2,  each  of  these  bounds  is 
multiplied  by  these  factors.  Since  the  first  component  of  7r[0,  j]  is  not  used,  the  propagate 
value  computed  at  each  output  combiner  vertex  can  be  eliminated.  This  saves  one  gate  per 
result  bit,  or  n  gates.  However,  for  each  0  <  j  <  n  —  1  we  need  two  gates  to  compute  Pj 
and  qj  and  one  gate  to  compute  Sj,  3 n  additional  gates.  The  computation  of  these  three 
sets  of  functions  adds  depth  2  to  that  of  the  prefix  circuit.  This  gives  the  desired  bounds.  ■ 

The  addition  function  /a©  is  computed  by  the  carry-lookahead  adder  circuit  with  1.6 
times  as  many  gates  as  the  ripple  adder  but  in  logarithmic  instead  of  linear  depth. 

When  exact  addition  is  expected  and  every  number  is  represented  by  n  bits,  a  carry-out  of 
the  last  stage  of  an  adder  constitutes  an  overflow,  an  error. 


2.8  Subtraction 

Subtraction  is  possible  when  negative  numbers  are  available.  There  are  several  ways  to  repre¬ 
sent  negative  numbers.  To  demonstrate  that  subtraction  is  not  much  harder  than  addition,  we 
consider  the  signed  two’s  complement  representation  for  positive  and  negative  integers  in  the 
set  Z(n)  =  {—2™, . . . ,  —2,  —  1,  0,  1,  2, . . . ,  211  —  1}.  Each  signed  number  u  is  represented  by 
an  (n  +  l)-tuple  (cr,  it),  where  a  is  its  sign  and  u  =  (un- 1, .  . . ,  Uo )  is  a  binary  number  that 
is  either  the  magnitude  |u|  of  the  number  u,  if  positive,  or  the  two’s  complement  2™  —  |u|  of 
it,  if  negative.  The  sign  a  is  defined  below: 


a  = 


0  the  number  u  is  positive  or  zero 
1  the  number  u  is  negative 


The  two’s  complement  of  an  n-bit  binary  number  v  is  easily  formed  by  adding  1  to  t  = 
2n  —  \  —  \v\.  Since  2n  —  1  is  represented  as  the  n-tuple  of  l’s,  t  is  obtained  by  complementing 
(NOTing)  every  bit  of  v.  Thus,  the  two’s  complement  of  u  is  obtained  by  complementing  every 
bit  of  u  and  then  adding  1 .  It  follows  that  the  two’s  complement  of  the  two’s  complement  of 
a  number  is  the  number  itself.  Thus,  the  magnitude  of  a  negative  number  (1,  it)  is  the  two’s 
complement  of  u. 
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This  is  illustrated  by  the  integers  in  the  set  Zj(4)  =  {—16, . . . ,  —2,  —1,  0, 1, 2, ,  15}. 
The  two’s  complement  representation  of  the  decimal  integers  9  and  —  1 1  are 


9=  (0,  1,0,0, 1) 

-11  =  (1,0, 1,0, 1) 


Note  that  the  two’s  complement  of  11  is  16— 11  =  5,  which  is  represented  by  the  four-tuple 
(0,  1,0,  1).  The  value  of  the  two’s  complement  of  1 1  can  be  computed  by  complementing  all 
bits  in  its  binary  representation  (1,0,  1,  1)  and  adding  1. 

We  now  show  that  to  add  two  numbers  u  and  v  in  two’s  complement  notation  (cr„,  u) 
and  (cr„,  v),  we  add  them  as  binary  ( n  +  l)-tuples  and  discard  the  overflow  bit,  that  is,  the 
coefficient  of  2n+1.  We  now  show  that  this  procedure  provides  a  correct  answer  when  no 
overflow  occurs  and  establish  conditions  on  which  overflow  does  occur. 

Let  |u|  and  |v|  denote  the  magnitudes  of  the  two  numbers.  There  are  four  cases  for  their 
sum  u  +  v: 


Case 

u 

V 

U  +  V 

I 

>  0 

>  0 

|u|  +  |v| 

II 

>  o 

<  0 

2n+1  +  u  -  |  v  | 

111 

<  0 

>  0 

2n+1  -  juj  +  v 

IV 

<  0 

<  0 

2n+l  +2n+l  _  juj  _  |v| 

In  the  first  case  the  sum  is 

positive 

.  If  the  coefficient  of  2n  is  1 ,  an  overflow  error  is  detected. 

In  the  second  case,  if  |u|  —  |v|  is  negative,  then  2n+1  +  |u|  —  |v|  =  2n  +  2n  —  |  |u|  —  |v|  j  and 
the  result  is  in  two’s  complement  notation  with  sign  1,  as  it  should  be.  If  |u|  —  |v|  is  positive, 
the  coefficient  of  2n  is  0  (a  carry-out  of  the  last  stage  has  occurred)  and  the  result  is  a  positive 
number  with  sign  bit  0,  properly  represented.  A  similar  statement  applies  to  the  third  case. 
In  the  fourth  case,  if  |u|  +  |v|  is  less  than  2n,  the  sum  is  2”+1  +  2”  +  (2”  —  (|u|  +  |v|)), 
which  is  2”  +  ( 2"  —  (|u|  +  |v| ))  when  the  coefficient  of  2n+1  is  discarded.  This  is  a  proper 
representation  for  a  negative  number.  However,  if  |u|  +  |v|  >  2™,  a  borrow  occurs  from  the 
(n  +  l)st  position  and  the  sum  2™+1  +  2n  +  (2"  —  (|u|  +  |v|))  has  a  0  in  the  (n  +  l)st 
position,  which  is  not  a  proper  representation  for  a  negative  number  (after  discarding  2”+1); 
overflow  has  occurred. 

The  following  procedure  can  be  used  to  subtract  integer  u  from  integer  v:  form  the  two’s 
complement  of  u  and  add  it  to  the  representation  for  v.  The  negation  of  a  number  is  obtained 
by  complementing  its  sign  and  taking  the  two’s  complement  of  its  binary  n-tuple.  It  follows 
that  subtraction  can  be  done  with  a  circuit  of  size  linear  in  n  and  depth  logarithmic  in  n.  (See 
Problem  2.27.) 

2.9  Multiplication 

In  this  section  we  examine  several  methods  of  multiplying  integers.  We  begin  with  the  stan¬ 
dard  elementary  integer  multiplication  method  based  on  the  binary  representation  of  numbers. 
This  method  requires  0(n2)  gates  and  has  depth  0(log2  n)  on  n-bit  numbers.  We  then  ex¬ 
amine  a  divide-and-conquer  method  that  has  the  same  depth  but  much  smaller  circuit  size. 
We  also  describe  fast  multiplication  methods,  that  is,  methods  that  have  circuits  with  smaller 
depths.  These  include  a  circuit  whose  depth  is  much  smaller  than  O(logn).  It  uses  a  novel 
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representation  of  numbers,  namely,  the  exponents  of  numbers  in  their  prime  number  decom¬ 
position. 

The  integer  multiplication  function  ©©  :  B2n  i— >  B2n  can  be  realized  by  the  standard 
integer  multiplication  algorithm,  which  is  based  on  the  following  representation  for  the 
product  of  integers  represented  as  binary  n-tuples  u  and  v: 


n— 1 n— 1 

M|u|  =  ^2  X]  uivj2l+j  (2-7) 

2=0  j— 0 

Here  it  and  |u|  are  the  magnitudes  of  the  integers  represented  by  it  and  v.  The  standard 
algorithm  forms  the  products  UiVj  individually  to  create  n  binary  numbers,  as  suggested  below. 
Here  each  row  corresponds  to  a  different  number;  the  columns  correspond  to  powers  of  2  with 
the  rightmost  column  corresponding  to  the  least  significant  component,  namely  the  coefficient 
of  2°. 


26 

25 

24 

23 

22 

21 

2° 

u0v} 

U0v2 

UqV\ 

u0v0 

=  zo 

U\V$ 

U\V2 

U\Vx 

U1V0 

0 

=  2 1 

U2V3 

U2V2 

U2V1 

U2Vo 

0 

0 

=  Z2 

U3V2 

U3Vi 

U3V0 

0 

0 

0 

=  Z3 

Let  the  *th  binary  number  produced  by  this  multiplication  operation  be  Z\.  Since  each  of 
these  n  binary  numbers  contains  at  most  2n  —  1  bits,  we  treat  them  as  if  they  were  (2 n  —  1)- 
bit  numbers.  If  these  numbers  are  added  in  the  order  shown  in  Fig.  2.16(a)  using  a  carry- 
lookahead  adder  at  each  step,  the  time  to  perform  the  additions,  measured  by  the  depth  of  a 
circuit,  is  0(n  log  n).  The  size  of  this  circuit  is  0(n2).  A  faster  circuit  containing  about  the 
same  number  of  gates  can  be  constructed  by  adding  Zo,  ■  ■  . ,  zn— %  in  a  balanced  binary  tree 
with  n  leaves,  as  shown  in  Fig.  2.16(b).  This  tree  has  n  —  1  (2 n  —  l)-bit  adders.  (A  binary 
tree  with  n  leaves  has  n  —  1  internal  vertices.)  If  each  of  the  adders  is  a  carry-lookahead  adder, 
the  depth  of  this  circuit  is  0(log2  n)  because  the  tree  has  0(log  n)  adders  on  every  path  from 
the  root  to  a  leaf. 


(a)  (b) 

Figure  2. 1  6  Two  methods  for  aggregating  the  binary  numbers  Zo, .  .  . ,  z„_ i. 
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2.9.1  Carry-Save  Multiplication 

We  now  describe  a  much  faster  circuit  obtained  through  the  use  of  the  carry-save  adder.  Let 
u,  v,  and  w  be  three  binary  n-bit  numbers.  Their  sum  is  a  binary  number  t.  It  follows  that 
\t\  can  be  represented  as 

1*1  =  M  +  \v\  +  |tr?| 

n—  1 

=  YXui  +  ^  +  Wi)2l 

i= 0 

With  a  full  adder  the  sum  {ui  +  Vi  +  wf  can  be  converted  to  the  binary  representation 
Cj+ 12  +  Si.  Making  this  substitution,  we  have  the  following  expression  for  the  sum: 

1*1  =  M  +  M  +  M 

n—  1 

=  +  Sj)2l 

i- o 

=  |c|  +  |s| 

Here  c  with  Co  =  0  is  an  (n  +  l)-tuple  and  s  is  an  n-tuple.  The  conversion  of  (ui,  Vi,  Wi )  to 
(cf+i.  Si)  can  be  done  with  the  full  adder  circuit  shown  in  Fig.  2.15  of  size  5  and  depth  3  over 
the  basis  H  =  {A,  V,  ©}. 

The  function  /carry-save  :  Bin  i— >  B2n+1  that  maps  three  binary  n-tuples,  u,  v,  and  w, 
to  the  pair  (c,  s)  described  above  is  the  carry-save  adder.  A  circuit  of  full  adders  that  realizes 
this  function  is  shown  in  Fig.  2. 17. 


THEOREM  2.9. 1  The  carry-save  adder  function  /^rry-save  :  Bin  i— >  Bln+ 2  can  be  realized  ivith 
the  following  size  and  depth  over  the  basis  Q  =  {A,  V,  ©}: 


Cn  ( /carry -save )  <  5n 


Three  binary  n-bit  numbers  u,  v,  w  can  be  added  by  combining  them  in  a  carry-save 
adder  to  produce  the  pair  (c,  s),  which  are  then  added  in  an  (n  +  l)-input  binary  adder.  Any 
adder  can  be  used  for  this  purpose. 


Sra-l 


Si 


SO 


Figure  2.17  A  carry-save  adder  realized  by  an  array  of  full  adders. 
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A  multiplier  for  two  n-bit  binary  can  be  formed  by  first  creating  the  n  (2 n  —  1 ) -bit  binary 
numbers  shown  in  (2.8)  and  then  adding  them,  as  explained  above.  These  n  numbers  can  be 
added  in  groups  of  three,  as  suggested  in  Fig.  2.18. 

Let’s  now  count  the  number  of  levels  of  carry-save  adders  in  this  construction.  At  the 
zeroth  level  there  are  mo  =  n  numbers.  At  the  jth  level  there  are 


TOj  =  2Lwj_i/3j  +  rrij-x  —  3|/%-i/3j  =  frij-i  — 

binary  numbers.  This  follows  because  there  are  \jrij- i/3j  groups  of  three  binary  numbers  and 
each  group  is  mapped  to  two  binary  numbers.  Not  combined  into  such  groups  are  mj- \  — 
[TOj-i  / 3j  binary  numbers,  giving  the  total  mj.  Since  ( X  —  2) /3  <  \_x/?>\  <  x/3,  we  have 


from  which  it  is  easy  to  show  by  induction  that  the  following  inequality  holds: 


—  I  n  <  mj  < 


< 


n  +  2 


Let  s  be  the  number  of  stages  after  which  ms  =  2.  Since  ms_  i  >  3,  we  have 

log2(n/2)  log  2n 

log2(3/2)  -  '  -  log2(3/2)  + 

The  number  of  carry-save  adders  used  in  this  construction  is  n  —  2.  This  follows  from  the 
observation  that  the  number  of  carry-save  adders  used  in  one  stage  is  equal  to  the  decrease  in 
the  number  of  binary  numbers  from  one  stage  to  the  next.  Since  we  start  with  n  and  finish 
with  2,  the  result  follows. 

After  reducing  the  n  binary  numbers  to  two  binary  numbers  through  a  series  of  carry-save 
adder  stages,  the  two  remaining  binary  numbers  are  added  in  a  traditional  binary  adder.  Since 
each  carry-save  adder  operates  on  three  (2 n—  1  )-bit  binary  numbers,  they  use  at  most  5(2n—  1) 
gates  and  have  depth  3.  Summarizing,  we  have  the  following  theorem  showing  that  carry-save 
addition  provides  a  multiplication  circuit  of  depth  O(logn)  but  of  size  quadratic  in  n. 


Po 

Pi 

Pi 

Pi 

Pa 

Pi 

Pg 

Pi 

Ps 


Figure  2. 1  8  Schema  for  the  carry-save  combination  of  nine  18-bit  numbers. 
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THEOREM  2.9.2  The  binary  multiplication  function  :  Bln  i— >  B2n  for  n-bit  binary 

numbers  can  be  realized  by  carry-save  addition  by  a  circuit  of  the  following  size  and  depth  over 
the  basis  12  =  {A,  V,  ©}: 

<  5(2n  -  l)(n  -2)  +  Cn  (/^}) 

<  3s  +  -Do  (/id?) 

where  s,  the  number  of  carry-save  adder  stages ,  satisfies 


s  < 


!og2^  ,  . 

log2(3/2)  + 


It  follows  from  this  theorem  and  the  results  of  Theorem  2.7.2  that  two  n-bit  binary  num¬ 
bers  can  be  multiplied  by  a  circuit  of  size  0(n 2)  and  depth  0(log  n). 


2.9.2  Divide-and-Conquer  Multiplication 

We  now  examine  a  multiplier  of  much  smaller  circuit  size  but  depth  0( log2  n).  It  uses  a 
divide-and-conquer  technique.  We  represent  two  positive  integers  by  their  n-bit  binary  num¬ 
bers  u  and  v.  We  assume  that  n  is  even  and  decompose  each  number  into  two  (n/2)-bit 
numbers: 


u=(uh,ui),  v  =  (vh,vl) 

where  Uh,  Ui,  Vh,  Vi  are  the  high  and  low  components  of  the  vectors  u  and  v,  respectively. 
Then  we  can  write 


M  =  \uh\2n/2  +  \ui\ 

M  =  |  Vh\2n/1  +  |i?i| 


from  which  we  have 

|«||«|  =  I«i||«z|  +  (\uh\\vh\  +  (\vh\  -  H)(|«,|  -  \uh\)  +  |«,|M)2n/2  +  \uh\\vh\2n 

It  follows  from  this  expression  that  only  three  integer  multiplications  are  needed,  namely 
|ttj|  |uj|,  \uh\\uh\,  and  (|w/t|  —  |u/|)(|m;|  —  multiplication  by  a  power  of  2  is  done  by 

realigning  bits  for  addition.  Each  multiplication  is  of  (n/2)-bit  numbers.  Six  additions  and 
subtractions  of  2n-bit  numbers  suffice  to  complete  the  computation.  Each  of  the  additions 
and  subtractions  can  be  done  with  a  linear  number  of  gates  in  logarithmic  time. 

If  C(n)  and  D(n)  are  the  size  and  depth  of  a  circuit  for  integer  multiplication  realized 
with  this  divide-and-conquer  method,  then  we  have 

C{n)  <  3C(n/2)  +  cn  (2.9) 

D(n)  <  D(n/2)  +  dlog2n  (2.10) 

where  c  and  d  are  constants  of  the  construction.  Since  C(  1)  =  1  and  I?(l)  =  1  (one  use 
of  AND  suffices),  we  have  the  following  theorem,  the  proof  of  which  is  left  as  an  exercise  (see 
Problem  2.28). 
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THEOREM  2.9.3  If  n  =  2k,  the  binary  multiplication  function  /©lt  :  B2n  <— >  B2n  for  n-bit 
binary  numbers  can  be  realized  by  a  circuit  for  the  divide-and-conquer  algorithm  of  the  following 
size  and  depth  over  the  basis  f l  =  {A,V,®}: 

=  O  (3log2  n)  =  O  (nlog2  3) 

Dn(f^t)  =  0(log2  n) 

The  size  of  this  divide-and-conquer  multiplication  circuit  is  0(n*'585),  which  is  much 
smaller  than  the  0(n2)  bound  based  on  carry-save  addition.  The  depth  bound  can  be  reduced 
to  0(log  n)  through  the  use  of  carry-save  addition.  (See  Problem  2.29.)  However,  even  faster 
multiplication  algorithms  are  known  for  large  n. 

2.9.3  Fast  Multiplication 

Schonhage  and  Strassen  [303]  have  described  a  circuit  to  multiply  integers  represented  in 
binary  that  is  asymptotically  small  and  shallow.  Their  algorithm  for  the  multiplication  of  n-bit 
binary  numbers  uses  0(n  log  n  log  log  n)  gates  and  depth  O(logn).  It  illustrates  the  point 
that  a  circuit  can  be  devised  for  this  problem  that  has  depth  O(logn)  and  uses  a  number  of 
gates  considerably  less  than  quadratic  in  n.  Although  the  coefficients  on  the  size  and  depth 
bounds  are  so  large  that  their  circuit  is  not  practical,  their  result  is  interesting  and  motivates 
the  following  definition. 

DEFINITION  2.9. 1  Mint  (n,  c)  is  the  size  of  the  smallest  circuit  for  the  multiplication  of  two  n-bit 
binary  numbers  that  has  depth  at  most  c  log2  nfor  c  >  0. 

The  Schonhage-Strassen  circuit  demonstrates  that  Mjnt(n,  c)  =  0{n  log  nlog  log  n)  for 
all  n  >  1.  It  is  also  clear  that  c)  =  f l(n)  because  any  multiplication  circuit  must 

examine  each  component  of  each  binary  number  and  no  more  than  a  constant  number  of 
inputs  can  be  combined  by  one  gate.  (Chapter  9  provides  methods  for  deriving  lower  bounds 
on  the  size  and  depth  of  circuits.) 

Because  we  use  integer  multiplication  in  other  circuits,  it  is  convenient  to  make  the  follow¬ 
ing  reasonable  assumption  about  the  dependence  of  M-ln t  ( n ,  c)  on  n.  We  assume  that 


Mint(dn,c )  <  dMint(n,c) 


for  all  d  satisfying  0  <  d  <  1 .  This  condition  is  satisfied  by  the  Schonhage-Strassen  circuit. 

2.9.4  Very  Fast  Multiplication 

If  integers  in  the  set  {0,  1, . . . ,  N  —  1}  are  represented  by  the  exponents  of  primes  in  their 
prime  factorization,  they  can  be  multiplied  by  adding  exponents.  The  largest  exponent  on  a 
prime  in  this  range  is  at  most  log2  N.  Thus,  exponents  can  be  represented  by  O  (log  log  N) 
bits  and  integers  multiplied  by  circuits  with  depth  O  (log  log  log  N).  (See  Problem  2.32.) 
This  depth  is  much  smaller  than  O  (log  log  N),  the  depth  of  circuits  to  add  integers  in  any 
fixed  radix  system.  (Note  that  if  N  =  2n,  log2  log2  N  =  log2  n .)  However,  addition  is  very 
difficult  in  this  number  system.  Thus,  it  is  a  fast  number  system  only  if  the  operations  are 
limited  to  multiplications. 
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2.9.5  Reductions  to  Multiplication 

The  logical  shifting  function  can  be  reduced  to  integer  multiplication  function  f^u\t  >  as 

can  be  seen  by  letting  one  of  the  two  n- tuple  arguments  be  a  power  of  2.  That  is, 

fs Shift  (*>«)  =  ^ L]  (/multl^y)) 

where  y  =  de(s)  is  the  value  of  the  decoder  function  (see  Section  2.5)  that  maps  a  binary 
m-tuple,  to  =  |~log2  n"|,  into  a  binary  2m -tuple  containing  a  single  1  at  the  output  indexed 
by  the  integer  represented  by  s  and  is  the  projection  operator  defined  on  page  50. 


LEMMA  2.9. 1  The  logical  shifting  function  can  b e  reduced  to  the  binary  integer  multipli¬ 
cation  function  through  the  application  of  the  decoder  function  de  on  m  =  flog2  n\ 

inputs. 


As  shown  in  Section  2.5,  the  decoder  function  /d™ode  can  be  realized  with  a  circuit  of  size 
very  close  to  2m  and  depth  [log2  m] .  Thus,  the  shifting  function  has  circuit  size  and  depth 
no  more  than  constant  factors  larger  than  those  for  integer  multiplication. 

The  squaring  function  /j quare  :  Bn  i— >  Bln  maps  the  binary  n-tuple  x  into  the  binary 
2n-tuple  y  representing  the  product  of  x  with  itself.  Since  the  squaring  and  integer  multipli¬ 
cation  functions  contain  each  other  as  subfunctions,  as  shown  below,  circuits  for  one  can  be 
used  for  the  other. 


LEMMA  2.9.2  The  integer  multiplication  function  contains  the  squaring  function  /iquare 

as  a  subfunction  and  f square  contains  as  a  subfunction. 


Proof  The  first  statement  follows  by  setting  the  two  n-tuple  inputs  of  to  be  the  input 
to  /"square .  The  second  statement  follows  by  examining  the  value  of  /iquare"*  on  the  (3n+ 1)- 
tuple  input  ( xzy ),  where  x  and  y  are  binary  n-tuples  and  z  is  the  zero  binary  (n+  l)-tuple. 
Thus,  (xzy)  denotes  the  value  a  =  22n+1  \x\  +  \y\  whose  square  b  is 


b  =  24n+2\x\2  +  22n+2\x\\y\  +  \y\2 


The  value  of  the  product  |a:||y|  can  be  read  from  the  output  because  there  is  no  carry 
into  22n+2|a:jjyj  from  |y|2,  nor  is  there  a  carry  into  24n+2\x\2  from  22n+2|a;||y|,  since 
\x\,  \y\  <  2n  —  1.  ■ 


2.10  Reciprocal  and  Division 

In  this  section  we  examine  methods  to  divide  integers  represented  in  binary.  Since  the  division 
of  one  integer  by  another  generally  cannot  be  represented  with  a  finite  number  of  bits  (consider, 
for  example,  the  value  of  2/3),  we  must  be  prepared  to  truncate  the  result  of  a  division.  The 
division  method  presented  in  this  section  is  based  on  Newton’s  method  for  finding  a  zero  of  a 
function. 

Let  u  =  (un- 1, . . . ,  Mi,  Mo)  and  v  =  (vn-i, . . .  ,V\,  vfj  denote  integers  whose  magni¬ 
tudes  are  u  and  v.  Then  the  division  of  one  integer  u  by  another  v,  m/m,  can  be  obtained  as  the 
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result  of  taking  the  product  of  u  with  the  reciprocal  1  /v.  (See  Problem  2.33.)  For  this  reason, 
we  examine  only  the  computation  of  reciprocals  of  n-bit  binary  numbers.  For  simplicity  we 
assume  that  n  is  a  power  of  2. 

The  reciprocal  of  the  n-bit  binary  number  u  =  (nn_i, . . . ,  U\,  Uq )  representing  the  in¬ 
teger  u  is  a  fractional  number  r  represented  by  the  (possibly  infinite)  binary  number  r  = 
(r_i,  r_2,  r_3, . . .),  where 

M  =  r_\2~l  +  r-2 2”2  +  r_3 2~3 H - 

Some  numbers,  such  as  3,  have  a  binary  reciprocal  that  has  an  infinite  number  of  digits,  such  as 
(0,  1, 0,  1,  0,  1, . . .),  and  cannot  be  expressed  exactly  as  a  binary  tuple  of  finite  extent.  Others, 
such  as  4,  have  reciprocals  that  have  finite  extent,  such  as  (0,  1). 

Our  goal  is  to  produce  an  (n  +  2) -bit  approximation  to  the  reciprocal  of  n-bit  binary 
numbers.  (It  simplifies  the  analysis  to  obtain  an  (n  +  2) -bit  approximation  instead  of  an  n-bit 
approximation.)  We  assume  that  each  such  binary  number  u  has  a  1  in  its  most  significant  po¬ 
sition;  that  is,  2n~l  <  u  <  2ra.  If  this  is  not  true,  a  simple  circuit  can  be  devised  to  determine 
the  number  of  places  by  which  to  shift  u  left  to  meet  this  condition.  (See  Problem  2.25.)  The 
result  is  shifted  left  by  an  equal  amount  to  produce  the  reciprocal. 

It  follows  that  an  (n  +  2)-bit  approximation  to  the  reciprocal  of  an  n-bit  binary  number  u 
with  itn_i  =  1  is  represented  by  r  =  (r_j,  r_2,  r_3, . . .),  where  the  first  n  —  2  digits  of  r  are 
zero.  Thus,  the  value  of  the  approximate  reciprocal  is  represented  by  the  n  +  2  components 
(r_(n_i),  f-(n),  •  ■  •  >  r_(2 n)).  It  follows  that  these  components  are  produced  by  shifting  r  left 
by  2 n  places  and  removing  the  fractional  bits.  This  defines  the  function  /r’":;p : 


The  approximation  described  below  can  be  used  to  compute  reciprocals. 

Newton’s  approximation  algorithm  is  a  method  to  find  the  zero  x{)  of  a  twice  contin¬ 
uously  differentiable  function  li  :  1R  h  R  on  the  reals  (that  is,  h(x o)  =  0)  when  h  has 
a  non-zero  derivative  h'(x)  in  the  neighborhood  of  Xo .  As  suggested  in  Fig.  2.19,  the  slope 
of  the  tangent  to  the  curve  at  the  point  yi,  h'(jji),  is  equal  to  h(yi)/(yi  —  yi+i).  For  the 
convex  increasing  function  shown  in  this  figure,  the  value  of  yi+i  is  closer  to  the  zero  Xo  than 


Figure  2. 1  9  Newton’s  method  for  finding  the  zero  of  a  function. 
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is  yi .  The  same  holds  for  all  twice  continuously  differentiable  functions  whether  increasing, 
decreasing,  convex,  or  concave  in  the  neighborhood  of  a  zero.  It  follows  that  the  recurrence 


Ui+ 1  =  Vi  - 


HVi) 

h'{y%) 


(2.11) 


provides  values  increasingly  close  to  the  zero  of  h  as  long  as  it  is  started  with  a  value  sufficiently 
close  to  the  zero. 

The  function  h(y)  =  1  —  2 2n /uy  has  zero  y  =  2ln ju.  Since  h  (y)  =  2 2n/uy2,  the 
recurrence  (2.1 1)  becomes 


Vi+ 1  =  2t/i  -  uy\f  2ln 


When  this  recurrence  is  modified  as  follows,  it  converges  to  the  (n  +  2) -bit  binary  reciprocal 
of  the  n-bit  binary  number  u : 


Vi+ 1 


22n+lyi  -  uyj 

2ln 


The  size  and  depth  of  a  circuit  resulting  from  this  recurrence  are  0(M;nt(n,  c)  logn)  and 
0(log2n),  respectively.  However,  this  recurrence  uses  more  gates  than  are  necessary  since  it 
does  calculations  with  full  precision  at  each  step  even  though  the  early  steps  use  values  of  yi 
that  are  imprecise.  We  can  reduce  the  size  of  the  resulting  circuit  to  0(Mint(n,  c))  if,  instead 
of  computing  the  reciprocal  with  n  +  2  bits  of  accuracy  at  every  step  we  let  the  amount  of 
accuracy  vary  with  the  number  of  stages,  as  in  the  algorithm  recip  (it,  n )  of  Fig.  2.20.  The 
algorithm  recip  is  called  1  +  log2  n  times,  the  last  time  when  n  =  1. 

We  now  show  that  the  algorithm  recipC u,n)  computes  the  function  /r  (u)  =  r  = 
L22n/wJ .  In  other  words,  we  show  that  r  satisfies  ru  =  2ln  —  s  for  some  0  <  s  <  u.  The 
proof  is  by  induction  on  n. 

The  inductive  hypothesis  is  that  the  algorithm  recip  (u,m)  produces  an  (m  +  2)-bit 
approximation  to  the  reciprocal  of  the  TO-bit  binary  number  u  (whose  most  significant  bit  is 
1),  that  is,  it  computes  r  =  [22m/uJ .  The  assumption  applies  to  the  base  case  of  m  =  1  since 
u  =  1  and  r  =  4.  We  assume  it  holds  for  m  =  n/2  and  show  that  it  also  holds  for  m  =  n. 


Algorithm  recip  ( u ,  n) 
if  n  =  1  then 

r  :=  4; 

else  begin 

t  :=  recip  (  Lw/2"/2J ,  n/2) ; 
r  :=  L(23n/2+1i-nf2)/2"J  ; 

for  j  :=  3  downto  0  do 

if  (u(r  +  2J)  <  22")  then  r  :=  r  +  2J  ; 

end; 

return(r); 


Figure  2.20  An  algorithm  to  compute  r,  the  (n  +  2)-bit  approximation  to  the  reciprocal  of  the 
n-bit  binary  number  u  representing  the  integer  u ,  that  is,  r  =  /r//ip(u). 
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Let  Mi  and  Mo  be  the  integers  corresponding  to  the  most  and  least  significant  n/2  bits 
respectively  of  u,  that  is,  u  =  U\2n!2  +  Mo.  Since  2"_1  <  u  <  2™,  2ra/2-1  <  Mi  < 
2 n/2.  Also,  [yfyij  =  u\  ■  By  the  inductive  hypothesis  f  =  [2n/MiJ  is  the  value  returned  by 
recip(Mi,  n/2);  that  is,  U\t  =  2n  —  s'  for  some  0  <  s'  <  Mi.  Let  w  =  23n/2  +  1t  —  ut2. 
Then 


uw  =  22n+1Mjf  +  23"/2  +  luQt  —  [t(Ul2n/1  +  mq)]2 


Applying  U\t  =  2"  —  s' ,  dividing  both  sides  by  2”,  and  simplifying  yields 


We  now  show  that 


_/mo_\ 

2n/2  y 


2 


uw 
—  > 
2n  — 


22n  -  8  m 


(2.12) 


(2.13) 


by  demonstrating  that  (s'  —  tua/2n I2)2  <  8m.  We  note  that  s'  <  U\  <  2™/2,  which  implies 
(s')2  <  2"/2Mi  <  M.  Also,  since  U\t  =  2"  —  s'  or  f  <  2n  ju\  we  have 


<  8m 


since  Mi  >  2”/2~1,mo  <  2ra/2,and2n_1  <  u.  The  desired  result  follows  from  the  observation 
that  (a  —  b)2  <  max  (a2,  62). 

Since  r  =  ©/2raJ ,  it  follows  from  (2.13)  that 


ur  =  u 


uw 

- m  >  2 

2n  ~ 


In 


—  9m 


It  follows  that  r  >  {2ln /u)  —  9.  Also  from  (2.12),  we  see  that  r  <  22™/m.  The  three-step 
adjustment  process  at  the  end  of  recip  (m,  m)  increases  ur  by  the  largest  integer  multiple  of 
m  less  than  16m  that  keeps  it  less  than  or  equal  to  2ln .  That  is,  r  satisfies  ur  =  2ln  —  s  for 
some  0  <  s  <  u,  which  means  that  r  is  the  reciprocal  of  u. 

The  algorithm  for  recip (m,  n)  translates  into  a  circuit  as  follows:  a)  recip  (m,  1)  is 
realized  by  an  assignment,  and  b)  recip (m,  n),  n  >  1,  is  realized  by  invoking  a  circuit  for 
recip  (Ly^yiJ,  n/2)  followed  by  a  circuit  for  [(23"/2  +  lt  —  ut2)/2n\  and  one  to  implement 
the  three-step  adjustment.  The  first  of  these  steps  computes  Lys7iJ>  which  does  not  require 
any  gates,  merely  shifting  and  discarding  bits.  The  second  step  requires  shifting  t  left  by  3n/2 
places,  computing  t2  and  multiplying  it  by  M,  subtracting  the  result  from  the  shifted  version 
of  t ,  and  shifting  the  final  result  right  by  n  places  and  discarding  low-order  bits.  Circuits  for 
this  have  size  cM;nt(n,  c)  for  some  constant  c  >  0  and  depth  0(log  n).  The  third  step  can  be 
done  by  computing  ur,  adding  m2-7  for  j  =  3,  2,  1,  or  0,  and  comparing  the  result  with  2ln. 
The  comparisons  control  whether  2°  is  added  to  r  or  not.  The  one  multiplication  and  the 
additions  can  be  done  with  circuits  of  size  c'  c)  for  some  constant  c'  >  0  and  depth 

0(log  n) .  The  comparison  operations  can  be  done  with  a  constant  additional  number  of  gates 
and  constant  depth.  (See  Problem  2.19.) 

It  follows  that  recip  can  be  realized  by  a  circuit  whose  size  CTecip(n)  is  no  more  than  a 
multiple  of  the  size  of  an  integer  multiplication  circuit,  M;nt(n,  c),  plus  the  size  of  a  circuit  for 
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the  invocation  of  recip  (  [^71 J  ,n/2).  That  is, 

fhecip  (rr)  ' '  Cfrecip(ti/2)  T  cMjutfn,  c) 

Crecip(l)  —  f 

for  some  constant  C  >  0.  This  inequality  implies  the  following  bound: 

log  n  log  n  j. 

Crecip^)  ^  C  ^  '  jHint  —  c-^int(^i  c)  ^  ~ f 

3=0  3=0  ^ 

=  0(Mint(n,c)) 

which  follows  since  Mjnt(dn,  c)  <  dM-lnt(n,  c )  when  d  <  1. 

The  depth  Dlecip(n)  of  the  circuit  produced  by  this  algorithm  is  at  most  clog  n  plus  the 
depth  Drecip(n/2).  Since  the  circuit  has  at  most  1  +  log2  n  stages  with  a  depth  of  at  most 
clog  n  each,  .Drecip(n)  <  2c log2  n  when  n  >  2. 

THEOREM  2. 1  0. 1  Ifn  =  2fc,  the  reciprocal  function  /r^?ip  :  Bn  i— >  Bn+2  for  n-bit  binary 
numbers  can  be  realized  by  a  circuit  with  the  following  size  and  depth: 

Cn(/r(rip)  <0(M„t(n,c)) 

(/recip)  <  C  log2  " 

VERY  FAST  RECIPROCAL  Beame,  Cook,  and  Hoover  [33]  have  given  an  O(logn)  circuit  for 
the  reciprocal  function.  It  uses  a  sequence  of  about  n2/logn  primes  to  represent  an  ?t-bit 
binary  number  x,  .5  <  x  <  1,  using  arithmetic  modulo  these  primes.  The  size  of  the  circuit 
produced  is  polynomial  in  n,  although  much  larger  than  M;nt  (n,  c) .  Reif  and  Tate  [325]  show 
that  the  reciprocal  function  can  be  computed  with  a  circuit  that  is  defined  only  in  terms  of  n 
and  has  a  size  proportional  to  Mjnt  (and  thus  nearly  optimal)  and  depth  O  (log  n  log  log  n). 
Although  the  depth  bound  is  not  quite  as  good  as  that  of  Beame,  Cook,  and  Hoover,  its  size 
bound  is  very  good. 

2.10.1  Reductions  to  the  Reciprocal 

In  this  section  we  show  that  the  reciprocal  function  contains  the  squaring  function  as  a  sub¬ 
function.  It  follows  from  Problem  2.33  and  the  preceding  result  that  integer  multiplication 
and  division  have  comparable  circuit  size.  We  use  Taylor’s  theorem  [315,  p.  345]  to  establish 
the  desired  result. 

THEOREM  2. 1  0.2  (Taylor)  Let  f{x)  :  IR  i— >  1R  be  a  continuotis  real-valued fimction  defined 
on  the  interval  [a,  b]  whose  kth  derivative  is  also  continuous  for  k  <  n  +  1  over  the  same  interval. 
Then  for  a  <  Xo  <  x  <  b,  f  ( x )  can  be  expanded  as 

f(x)  =  f(x 0)  +  (x-  x0)f[l]{x0)  +  ^  ,lQ)  f[2]{x 0)  H - b  — — p^-f[n](x 0)  +  rn 

2  n\ 

where  denotes  the  nth  derivative  of  f  and  the  remainder  rn  satisfies 

rn=  r^W-^dt 
Jx  „  n\ 
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(X  -  Xg)n+l 

(n+  1)! 


/[n+1](V0 


for  some  f  satisfying  Xq  <  tfi  <  x. 


Taylors  theorem  is  used  to  expand  [22n  1  /|ix|J  by  applying  it  to  the  function  f(w )  = 
(1  +  u©1  on  the  interval  [0,  1].  The  Taylor  expansion  of  this  function  is 


( 1  T  w'j  1  =  1  —  w  +  w2  —  w3(l  +  i/j) 


-4 


for  some  0  <  ip  <  1.  The  magnitude  of  the  last  term  is  at  most  w 3. 

Let  n  >  12,  k  =  [n/2\,l  =  [ri / 1 2j  and  restrict  |u|  as  follows: 

|it|  =  2k  +  |a|  where 
|a|  =  2l  |fe|  +  1  and 
\b\  <  2/_1  -  1 

It  follows  that  |a|  <  2ll~l  —  2l  +  1  <  2ll~l  for  l  >  1.  Applying  the  Taylor  series  expansion 
to  (1  +  \a\/2k)~l,  we  have 


-yin—  1 


(2k  +  |a|) 


~%2n—l  —  k 


l  - 


2k 


(i  +  V0“ 


(2.14) 


for  some  0  <  f  <  1.  For  the  given  range  of  values  for  |w|  both  the  sum  of  the  first  two  terms 
and  the  third  term  on  the  right-hand  side  have  the  following  bounds: 

22n-i-fe(!  _  \a\/2k)  >  2ln~l~k  (l  -  22l~1/2k) 


2zn~i-k(\a\/2k)1  <  2In~l~k  (2ll~l /2k) 


[1  -  2 

22„-i-fe  (2ll~1/2k) 

Since  2ll~l  /2k  <  1/2,  the  value  of  the  third  term,  2ln~l~k(\a\/2k)2,  is  an  integer  that  does 
not  overlap  in  any  bit  positions  with  the  sum  of  the  first  two  terms. 

The  fourth  term  is  negative;  its  magnitude  has  the  following  upper  bound: 


-)2n—  1—  4k  |  1 3 


a|3(1+^)-4  <  23(2i-l)+2n— 1- 


4k 


Expanding  the  third  term,  we  have 


22n-1-3fc(|a|)2  =  22n-l-3fc(-22i|b|2 


1 


l&l  +  l) 


Because  3(2Z  —  1)  <  k,  the  third  term  on  the  right-hand  side  of  this  expansion  has  value 
22n—  l— . 3k  ancj  js  [arger  magnitude  of  the  fourth  term  in  (2.14).  Consequently  the 

fourth  term  does  not  affect  the  value  of  the  result  in  (2.14)  in  positions  occupied  by  the  binary 
representation  of  22ra-1~3fc(22i|fo|2  +  2i+1|b|).  In  turn,  2l+1\b\  is  less  than  221 ,  which  means 
that  the  binary  representation  of  22n~1~ik  (22l\b\2)  appears  in  the  output  shifted  but  otherwise 
without  modification.  This  provides  the  following  result. 

LEMMA  2.  I  0. 1  The  reciprocal  function  j  contains  as  a  stibfunction  the  squaring  function 

/square  for  TO  =  [n/ 12  J  -  1. 

Proof  The  value  of  the  Z-bit  binary  number  denoted  by  b  appears  in  the  output  if  l  = 

L?t/12j  >  1.  ■ 

Lower  bounds  similar  to  those  derived  for  the  reciprocal  function  can  be  derived  for  special 
fractional  powers  of  binary  numbers.  (See  Problem  2.35.) 
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2.11  Symmetric  Functions 

The  symmetric  functions  are  encountered  in  many  applications.  Among  the  important  sym¬ 
metric  functions  is  binary  sorting,  the  binary  version  of  the  standard  sorting  function.  A 
surprising  fact  holds  for  binary  sorting,  namely,  that  it  can  be  realized  on  n  inputs  by  a  cir¬ 
cuit  whose  size  is  linear  in  n  (see  Problem  2.17),  whereas  non-binary  sorting  requires  on  the 
order  of  n  log  n  operations.  Binary  sorting,  and  all  other  symmetric  functions,  can  be  realized 
efficiently  through  the  use  of  a  counting  circuit  that  counts  the  number  of  l’s  among  the  n 
inputs  with  a  circuit  of  size  linear  in  n.  The  counting  circuit  uses  AND,  OR,  and  NOT.  When 
negations  are  disallowed,  binary  sorting  requires  on  the  order  of  n  log  n  gates,  as  shown  in 
Section  9.6.1. 

DEFINITION  2. 1  I .  I  A  permutation  7t  of  an  n-tuple  x  =  ( X\,X2 ,  ■  ■  ■  ,xn )  is  a  reordering 
ir(x)  =  (a:ff(i),  xn(2),  •  ■  • ,  £„•(«))  of  the  components  ofx.  That  is,  {rr(l),  7t(2),  . . . ,  7t(n)}  = 
{ 1, 2,  3, ... ,  n}.  A  symmetric  function  / 1")  :  Bn  i— >  Bm  is  a  function  for  which  f^'fx)  = 
f(n\ 7t(a:))  for  all  permutations  n.  Sn>m  is  the  set  of  all  symmetric  functions  f ^  :  Bn  i— >  Bm 
and  Sn  =  Sn,i  is  the  set  of  Boolean  symmetric  functions  on  n  inputs. 

If  is  symmetric,  then  / ®  (0,  1,  1)  =  1,  0,  1)  =  /®(l,  1,0). 

The  following  are  symmetric  functions: 

1.  Threshold  functions  :  Bn  i— >  B,  1  <  t  <  n: 

tm Ix)=i  1  a  * 

*  |  0  otherwise 

2.  Elementary  symmetric  functions  e[n)  :  Bn  ^  B,  0  <  t  <  n : 

(n),  X  /  1  £"=1  Xj  =  t 

0  otherwise 


3.  Binary  sorting  function  f^t  :  Bn  i— >  Bn  sorts  an  n-tuple  into  descending  order: 

£w  =  (T.T . T) 

Here  is  the  fth  threshold  function. 


4.  Modulus  functions  f 

J  ( 


(n) 

c,  mod  m 


:  Bn  ^  B,  0  <  c  <  m  —  1: 


f(n)  (x)  = 

J  c,  mod  ra  v*/ 


1  £"= 1  X3  =  c  m°d 

0  otherwise 


The  elementary  symmetric  functions  e*  are  building  blocks  in  terms  of  which  other  sym¬ 
metric  functions  can  be  realized  at  small  additional  cost.  Each  symmetric  function  / ^  is 
determined  uniquely  by  its  value  Vt,  0  <  t  <  n,  when  exactly  t  of  the  input  variables  are  1.  It 
follows  that  we  can  write  f^n\x)  as 

/(n)0)  =  \J  vtAe(tn\x)=  \J  ef\x) 

0<t<n  1 1  v±  —  1 


(2.15) 
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Thus,  efficient  circuits  for  the  elementary  symmetric  functions  yield  efficient  circuits  for  gen¬ 
eral  symmetric  functions. 

An  efficient  circuit  for  the  elementary  symmetric  functions  can  be  obtained  from  a  circuit 
for  counting  the  number  of  l’s  among  the  variables  x.  This  counting  function  : 

Bn  i — >  1  produces  a  [log2(  n  +  1)]  -bit  binary  number  representing  the  number  of 

l’s  among  the  n  inputs  X\,  X2,  ■  ■  ■ ,  xn. 

A  recursive  construction  for  the  counting  function  is  shown  in  Fig.  2.21  (b)  when  m  = 
2i+I  —  1.  The  m  inputs  are  organized  into  three  groups,  the  first  2l  —  1  Boolean  variables  u, 
the  second  2l  —  1  variables  v,  and  the  last  variable  xm.  The  sum  is  represented  by  l  “sum  bits” 
0  <  j  <  l  —  1,  and  the  “carry  bit”  c©'/  This  sum  is  formed  by  adding  in  a  ripple 
adder  the  outputs  s'p,  0  <  j  <  1  —  2,  and  from  the  two  counting  circuits,  each  on 

2l  —  1  inputs,  and  the  mth  input  xm.  (We  abuse  notation  and  use  the  same  variables  for  the 
outputs  of  the  different  counting  circuits.)  The  counting  circuit  on  22  —  1  =  3  inputs  is  the 
full  adder  of  Fig.  2.21(a).  From  this  construction  we  have  the  following  theorem: 

LEMMA  2.1  I .  I  For  n  =  2k  —  1  ,  k  >  2,  the  counting  function  /c"unt  :  1 — >  f?rioS2(n+1)l 

can  be  realized  with  the  following  circuit  size  and  depth  over  the  basis  F2  =  {A,  V,  ®}: 

Cn(f^L)  <  5(2fe  —  fc  —  1) 

Da  (/i”L)  <  4fc  —  5 

Proof  Let  C{k)  =  Cq  (/c(”unt)  and  D(k)  =  Dn  (/count)  when  n  =  2k  Clearly, 

C(2)  =  5  and  D( 2)  =  3  since  a  full  adder  over  fi  =  {A,  V,  ©}  has  five  gates  and  depth  3. 

The  following  inequality  is  immediate  from  the  construction: 

C(k)  <2C(k-  l)  +  5(fc-  1) 


0O+1) 

b3 


(a) 


(b) 


Figure  2.2  I  A  recursive  construction  for  the  counting  function  /c^nt,  m  =  2i+1  —  1. 


76 


Chapter  2  Logic  Circuits 


Models  of  Computation 


The  size  bound  follows  immediately.  The  depth  bound  requires  a  more  careful  analysis. 

Shown  in  Fig.  2.21(a)  is  a  full  adder  together  with  notation  showing  the  amount  by 
which  the  length  of  a  path  from  one  input  to  another  is  increased  in  passing  through  it 
when  the  full-adder  circuit  used  is  that  shown  in  Fig.  2.14  and  described  by  Equation  2.6. 
From  this  it  follows  that 

Dn  (c^l))  =  max  (^Dq  (cf+1^  +  2,  Dn  (s^  +  3^ 

Dn  («$,+1))  =  max  (r>n  (cf+1))  +  1  ,Da  (sf  )  +  2) 

for  2  <  l  and  0  <  j  <  l  —  1,  where  =  c\-i-  It  can  be  shown  by  induction  that 
Dq  =  2(^+j)~3,  1  <  j  <  k-1,  and  Dn  =  2(k+j)—2,  0  <  j  <  k- 2, 

both  for  2  <  k.  (See  Problem  2.16.)  Thus,  Dn  /iountj  =  Dn  =  (4fc  —  5).  ■ 

We  now  use  this  bound  to  derive  upper  bounds  on  the  size  and  depth  of  symmetric  func¬ 
tions  in  the  class  Sn>rn. 

THEOREM  2.1  I .  I  Every  symmetric  function  :  Bn  1— >  Bm  can  be  realized  with  the  following 

circuit  size  and  depth  over  the  basis  12  =  {A,  V,  ®}  where  4>(k)  =  5(2fc  —  k  —  1): 

Cn  <  m\(n  +  l)/2]  +  <j>(k)  +  2{n  +  1)  +  (2|"log2(n  +  1)]  -  2 )\/2(n  +  1) 

Dn  (/("})  <  5|"log2(n  +  1)1  +  [log2[log2(n  +  1)]]  -4 

fork  =  |"log2(n+  1)]  even. 

Proof  Lemma  2.11.1  establishes  bounds  on  the  size  and  depth  of  the  function  f^oun t  f°r 
n  =  2k  —  1.  For  other  values  of  n,  let  k  =  |"log2(n  +  1)]  and  fill  out  the  2k  —  1  —  n 
variables  with  0’s. 

The  elementary  symmetric  functions  are  obtained  by  applying  the  value  of  /count  as 
argument  to  the  decoder  function.  A  circuit  for  this  function  has  been  constructed  that  has 
size  2 (n  +  1)  +  (2[log2(n  +  1)]  —  2 )y/2(n  +  1)  and  depth  flog2 [log2(n  +  1)]]  +  1. 
(See  Lemma  2.5.4.  We  use  the  fact  that  2^°Sirn^  <  2m.)  Thus,  all  elementary  symmetric 
functions  on  n  variables  can  be  realized  with  the  following  circuit  size  and  depth: 

Cn  (eo")>eiTl)>  ■  •  --e^)  <  f(k)  +  2 (n+  1)  +  (2|"log2(n+  1)]  -  2) ^2(71+  1) 

Dn  (eon)»ein),--->enn))  <  -  5  +  riog2|"log2(n  +  1)]]  +  1 

The  expansion  of  Equation  (2.15)  can  be  used  to  realize  an  arbitrary  Boolean  symmetric 
function.  Clearly,  at  most  n  OR  gates  and  depth  [log2  n]  suffice  to  realize  each  one  of  m 
arbitrary  Boolean  symmetric  functions.  (Since  the  Vt  are  fixed,  no  ANDs  are  needed.)  This 
number  of  ORs  can  be  reduced  to  (n  —  1) / 2  as  follows:  if  \{n  +  1) /2]  or  more  elementary 
functions  are  needed,  use  the  complementary  set  (of  at  most  |_(rz  +  l)/2j  functions)  and 
take  the  complement  of  the  result.  Thus,  no  more  than  |"(n+  l)/2]  —  1  ORs  are  needed  per 
symmetric  function  (plus  possibly  one  NOT),  and  depth  at  most  |"log2L((n+  1)/2)J]  +  1 
<  riog2(n+  I)).  ■ 
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This  theorem  establishes  that  the  binary  sorting  /©)  :  Bn  <— >  Bn  has  size  0(n2).  In  fact, 
a  linear-size  circuit  can  be  constructed  for  it,  as  stated  in  Problem  2.17. 

2.12  Most  Boolean  Functions  Are  Complex 

As  we  show  in  this  section,  the  circuit  size  and  depth  of  most  Boolean  functions  /  :  Bn  i— >  B 
on  n  variables  are  at  least  exponential  and  linear  in  n,  respectively.  Furthermore,  we  show  in 
Section  2.13  that  such  functions  can  be  realized  with  circuits  whose  size  and  depth  are  at  most 
exponential  and  linear,  respectively,  in  n.  Thus,  the  circuit  size  and  depth  of  most  Boolean 
functions  on  n  variables  are  tightly  bounded.  Unfortunately,  this  result  says  nothing  about  the 
size  and  depth  of  a  specific  function,  the  case  of  most  interest. 

Each  Boolean  function  on  n  variables  is  represented  by  a  table  with  2™  rows  and  one 
column  of  values  for  the  function.  Since  each  entry  in  this  one  column  can  be  completed  in 
one  of  two  ways,  there  are  22  ways  to  fill  in  the  column.  Thus,  there  are  exactly  22  Boolean 
functions  on  n  variables.  Most  of  these  functions  cannot  be  realized  by  small  circuits  because 
there  just  are  not  enough  small  circuits. 

THEOREM  2. 1 2. 1  Let  0  <  e  <  1.  The  fraction  of  the  Boolean  functions  f  :  Bn  i— >  B  that 
have  size  complexity  Cn0(f)  satisfying  the  following  lower  bound  is  at  least  1  —  2~(e/2)2  when 
n  >  2  [( 1  —  e)  /  e]  log2[(3e)2(l  —  e/2)].  ( Here  e  =  2.71828  ...  is  the  base  of  the  natural 
logarithm.) 

2n 

Ca0{f)>-(l-e)-2n2 

n 

Proof  Each  circuit  contains  some  number,  say  g,  of  gates  and  each  gate  can  be  one  of  the 
three  types  of  gate  in  the  standard  basis.  The  circuit  with  no  gates  computes  the  constant 
functions  with  value  of  1  or  0  on  all  inputs. 

An  input  to  a  gate  can  either  be  the  output  of  another  gate  or  one  of  the  n  input  variables. 
(Since  the  basis  Oo  is  {AND,  OR,  NOT},  no  gate  need  have  a  constant  input.)  Since  each 
gate  has  at  most  two  inputs,  there  are  at  most  (<?  —  1  +  n)2  ways  to  connect  inputs  to  one 
gate  and  (<?  —  1  +  n)2®  ways  to  interconnect  g  gates.  In  addition,  since  each  gate  can  be 
one  of  three  types,  there  are  39  ways  to  name  the  gates.  Since  there  are  g\  orderings  of 
g  items  (gates)  and  the  ordering  of  gates  does  not  change  the  function  they  compute,  at 
most  N(g)  =  3 9(g  +  n)2g / g\  distinct  functions  can  be  realized  with  g  gates.  Also,  since 
g\  >  g9e~ 9  (see  Problem  2.2)  it  follows  that 

N{g)  <  (3 e)9[{g2  +  2 gn  +  n2)/g}9  <  (3 e)9{g  +  2n2)9 

The  last  inequality  follows  because  2 gn  +  n2  <  2gn2  for  n  >  2.  Since  the  last  bound  is  an 
increasing  function  of  g,  N( 0)  =  2  and  G  +  1  <  (3e)G  for  G  >  1,  the  number  M{G )  of 
functions  realizable  with  between  0  and  G  gates  satisfies 

M(G)  <  (G  +  l)(3e)G(G  +  2n2)G  <  [(3e)2(G  +  2n2)]G  <  (xx)l/a 

where  x  =  a(G  +  2 n2)  and  a  =  (3e)2.  With  base-2  logarithms,  it  is  straightforward  to 
show  that  xx  <  2X°  if  x  <  Xo/\og2  Xo  and  Xq  >  2. 

If  M(G)  <  2(1-'5)2"  for  0  <  S  <  1,  at  most  a  fraction  jpfy  =  2 -52”  of  the 

Boolean  functions  on  n  variables  have  circuits  with  G  or  fewer  gates. 
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Let  G  <  2ra(l  —  e)/n  —  2 n2.  Then  x  =  a(G  +  2 n2)  <  a2”(l  —  e)/n  <  Xo/ log2  Xo 
for  xo  =  a2n(l  —  e/2)  when  n  >  2[(1  —  e)/e]  log2[(3e)2(l  —  e/2)],  as  can  be  shown 
directly.  It  follows  that  M(G)  <  ( xx)l/a  <  2X°  =  2 2rl(1-e/2)_  u 

To  show  that  most  Boolean  functions  /  :  Bn  i— »  B  over  the  basis  flo  require  circuits  with 
a  depth  linear  in  n,  we  use  a  similar  argument.  We  first  show  that  for  every  circuit  there  is  a 
tree  circuit  (a  circuit  in  which  either  zero  or  one  edge  is  directed  away  from  each  gate)  that 
computes  the  same  function  and  has  the  same  depth.  Thus  when  searching  for  small-depth 
circuits  it  suffices  to  look  only  at  tree  circuits.  We  then  obtain  an  upper  bound  on  the  number 
of  tree  circuits  of  depth  d  or  less  and  show  that  unless  d  is  linear  in  n,  most  Boolean  functions 
on  n  variables  cannot  be  realized  with  this  depth. 

LEMMA  2.12.1  Given  a  circuit  for  a  function  f  :  Bn  i— >  Bm,  a  tree  circuit  can  be  constructed  of 
the  same  depth  that  computes  f. 

Proof  Convert  a  circuit  to  a  tree  circuit  without  changing  its  depth  as  follows:  find  a  vertex 
v  with  out-degree  2  or  more  at  maximal  distance  from  an  output  vertex.  Attach  a  copy  of  the 
tree  subcircuit  with  output  vertex  v  to  each  of  the  edges  directed  away  from  v.  This  reduces 
by  1  the  number  of  vertices  with  out-degree  greater  than  1  but  doesn’t  change  the  depth  or 
function  computed.  Repeat  this  process  on  the  new  circuit  until  no  vertices  of  outdegree 
greater  than  1  remain.  ■ 

We  count  the  number  of  tree  circuits  of  depth  d  as  follows.  First,  we  determine  T(d),  the 
number  of  binary,  unlabeled,  and  unoriented  trees  of  depth  d.  (The  root  has  two  descendants 
as  does  every  other  vertex  except  for  leaves  which  have  none.  No  vertex  carries  a  label  and  we 
count  as  one  tree  those  trees  that  differ  only  by  the  exchange  of  the  two  subtrees  at  a  vertex.) 
We  then  multiply  T{d)  by  the  number  of  ways  to  label  the  internal  vertices  with  one  of  at 
most  three  gates  and  the  leaves  by  at  most  one  of  n  variables  or  constants  to  obtain  an  upper 
bound  on  N(d),  the  number  of  distinct  tree  circuits  of  depth  d.  Since  a  tree  of  depth  d  has  at 
most  2d  —  1  internal  vertices  and  2d  leaves  (see  Problem  2.3),  N(d)  <  T(d) 32  (n  +  2)2  . 

LEMMA  2. 1 2.2  When  d  >  4  the  number  T(d)  of  depth-d  unlabeled,  imoriented  binary  trees 
satisfies  T (d)  <  (56)2 

Proof  There  is  one  binary  tree  of  depth  0,  a  tree  containing  a  single  vertex,  and  one  of 
depth  1.  Let  C{d)  be  the  number  of  unlabeled,  unoriented  binary  trees  of  depth  d  or  less, 
including  depth  0.  Thus,  C(0)  =  1,  T(l)  =  1,  and  C(l)  =  2.  This  recurrence  for  C(d) 
follows  immediately  for  d  >  1 : 

C(d)  =  C(d-l)+T(d)  (2.16) 

We  now  enumerate  the  unoriented,  unlabeled  binary  trees  of  depth  d  +  1 .  Without  loss  of 
generality,  let  the  left  subtree  of  the  root  have  depth  d.  There  are  T(d)  such  subtrees.  The 
right  subtree  can  either  be  of  depth  d  —  1  or  less  (there  are  C{d  —  1 )  such  trees)  or  of  depth 
d.  In  the  first  case  there  are  T(d)C(d —  1)  trees.  In  the  second,  there  are  T(d)(T(d)  —  l)/2 
pairs  of  different  subtrees  (orientation  is  not  counted)  and  T(d)  pairs  of  identical  subtrees. 
It  follows  that 


T(d  +  1)  =  T(d)C(d  -  1)  +  T(d)(T(d)  -  l)/2  +  T(d) 


(2.17) 
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Thus,  T( 2)  =  2,  C( 2)  =  4,  T(3)  =  7,  C(3)  =  11,  and  T(4)  =  56.  From  this  recurrence 
we  conclude  that  T(d+ 1 )  >  T2  ( d)  / 2.  We  use  this  fact  and  the  inequality  y  >  1/(1  —  1  /y), 
which  holds  for  y  >  2,  to  show  that  (T(d  +  1)/T(d))  +  T(d)/2  <  T(d  +  l)/2.  Since 
T(d)  >  4  for  d  >  3,  it  follows  that  T(d)/2  >  1/(1  —  2/T(d)).  Replacing  T(d) /2  by  this 
lower  bound  in  the  inequality  T(d  +  1)  >  T2(d)  / 2,  we  achieve  the  desired  result  by  simple 
algebraic  manipulation.  We  use  this  fact  below. 

Solving  the  equation  (2.17)  for  C( d  —  1),  we  have 


_  T(d  +  1)  (T(d)  +  1) 

T{d)  2 


Substituting  this  expression  into  (2.16)  yields  the  following  recurrence: 


(2.18) 


T{d  +  2)  _  T{d+  1)  (T(d+1)  +  T(d)) 

T(d  +  1)  _  T(d)  +  2 


Since  ( T(d  +  1  )/T(d))  +  T(d)/2  <  T(d  +  l)/2,  it  follows  that  T(d  +  2)  satisfies  the 
inequality  T(d  +  2)  <  T2(d  +  1)  when  d  >  3  or  T(d)  <  T2(d  —  1)  when  d  >  5  and 
d  —  1  >  4.  Thus,  T(d)  <  T2'  (d  -  /)  for  d  -  j  >  4  or  T(d)  <  (56)2d_4  for  d  >  4.  ■ 


Combine  this  with  the  early  upper  bound  on  N(d)  for  the  number  of  tree  circuits  over  flo 
of  depth  d  and  we  have  that  N(d)  <  c2  ford  >  4,  where  c  =  3((56)1/,16)(n+2).  (Note  that 
3(5  6)1/16  <  4.)  The  number  of  such  trees  of  depth  0  through  d  is  at  most  N(d  +  1)  <  c2 
But  if  c2  °  is  at  most  22  l1-©  then  a  fraction  of  at  most  2~Sl  of  the  Boolean  functions  on 
n  variables  have  depth  Dq  or  less.  But  this  holds  when 

Dq  =  n  —  1  —  <51og2  e  —  log2  log24(n  +  2)  =  n  —  log  log  n  —  0(1) 


since  ln(l  —  x)  <  —x.  Note  that  d  >  4  implies  that  n  >  d  +  1. 

THEOREM  2. 1 2.2  For  each  0  <  <5  <  1  a  fraction  of  at  least  1  —  2~S1  of  the  Boolean  functions 
f  :  Bn  i— >  B  have  depth  complexity  Dq0  (/)  that  satisfies  the  following  bound  when  n  >  5: 

Dn0  ( f)>n -  log  log  n  -  0{  1) 


As  the  above  two  theorems  demonstrate,  most  Boolean  functions  on  n  variables  require 
circuits  whose  size  and  depth  are  approximately  2ra/n  and  n,  respectively.  Fortunately,  most 
of  the  useful  Boolean  functions  are  far  less  complex  than  these  bounds  suggest.  In  fact,  we 
often  encounter  functions  whose  size  is  polynomial  in  n  and  whose  depth  is  logarithmic  in  or 
a  small  polynomial  in  the  logarithm  of  the  size  of  its  input.  Functions  that  are  polynomial  in 
the  logarithm  of  n  are  called  poly-logarithmic. 


2.13  Upper  Bounds  on  Circuit  Size 

In  this  section  we  demonstrate  that  every  Boolean  function  on  n  variables  can  be  realized  with 
circuit  size  and  depth  that  are  close  to  the  lower  bounds  derived  in  the  preceding  section. 
We  begin  by  stating  the  obvious  upper  bounds  on  size  and  depth  and  then  proceed  to  obtain 
stronger  (that  is,  smaller)  upper  bounds  on  size  through  the  use  of  refined  arguments. 
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As  shown  in  Section  2.2.2,  every  Boolean  function  /  :  Bn  i— >  B  can  be  realized  as  the  OR 
of  its  minterms.  As  shown  in  Section  2.5.4,  the  minterms  on  n  variables  are  produced  by  the 
decoder  function  /d(2ode  :  Bn  i— >  B2  ,  which  has  a  circuit  with  2”  +  {In  —  2)2nl2  gates  and 
depth  |"log2  n]  +  1.  Consequently,  we  can  realize  /  from  a  circuit  for  /d”c^ode  and  an  OR  tree 
on  at  most  2n  inputs  (which  has  at  most  2”  —  1  two-input  ORs  and  depth  at  most  n).  We 
have  that  every  function  /  :  Bn  i— >  B  has  circuit  size  and  depth  satisfying: 

Cnif)  <  Cn  (/tUe)  +  2"  -  1  <  2n+1  +  (2 n  -  2)2"/2  -  1 

Dn{f)  <  Dq  (/decode)  +n  <n+  ri0g2  Tl  +  1}  +  1 

Thus  every  Boolean  function  /  :  Bn  i— >  B  can  be  realized  with  an  exponential  number  of 
gates  and  depth  n  +  |~log2  n  \  +  1 .  Since  the  depth  lower  bound  of  n  —  O  (log  log  n )  applies  to 

almost  all  Boolean  functions  on  n  variables  (see  Section  2.12),  this  is  a  very  good  upper  bound 

on  depth.  We  improve  upon  the  circuit  size  bound  after  summarizing  the  depth  bound. 

THEOREM  2. 13.1  The  depth  complexity  of  every  Boolean  function  f  :  Bn  i— >  B  satisfies  the 
following  bound: 


Dnff)  <n  +  riog2n]  +  1 


We  now  describe  a  procedure  to  construct  circuits  of  small  size  for  arbitrary  Boolean  func¬ 
tions  on  n  variables.  By  the  results  of  the  preceding  section,  this  size  will  be  exponential  in  n. 
The  method  of  approach  is  to  view  an  arbitrary  Boolean  function  /  :  Bn  i— >  B  on  n  input  vari¬ 
ables  x  as  a  function  of  two  sets  of  variables,  a,  the  first  k  variables  of  x,  and  b,  the  remaining 
n  —  k  variables  of  x.  That  is,  x  =  ab  where  a  =  (xi, . . . ,  xfi)  and  b  =  (xfc+i, .  . . ,  xn). 

As  suggested  by  Fig.  2.22,  we  rearrange  the  entries  in  the  defining  table  for  /  into  a  rectan¬ 
gular  table  with  2fe  rows  indexed  by  a  and  2n~k  columns  indexed  by  b.  The  lower  right-hand 
quadrant  of  the  table  contains  the  values  of  the  function  /.  The  value  of  /  on  x  is  the  entry 
at  the  intersection  of  the  row  indexed  by  the  value  of  a  and  the  column  indexed  by  the  value 
of  b.  We  fix  s  and  divide  the  lower  right-hand  quadrant  of  the  table  into  p  —  1  groups  of  s 
consecutive  rows  and  one  group  of  s'  <  s  consecutive  rows  where  p  =  [2fe/s] .  (Note  that 
(p  —  l)s  +  s'  =  2k .)  Call  the  ?'th  collections  of  rows  A.,  .  This  table  serves  as  the  basis  for  the 
(fc,  s)-Lupanov  representation  of  /,  from  which  a  smaller  circuit  for  /  can  be  constructed. 

Let  fi  :  Bn  »  B  be  /  restricted  to  Ay,  that  is, 


fi{x) 


f(x)  if  a&Ai 
0  otherwise. 


It  follows  that  /  can  be  expanded  as  the  OR  of  the  /,: 


f(x)  =  V  fi(x) 

i= 1 


We  now  expand  /,; .  When  b  is  fixed,  the  values  for  ffiab)  when  a  £  Ai  constitute  an 
s-tuple  (s'-tuple)  ttforl  <  i  <  p  —  1  (for  i  =  p).  Let  Byv  be  those  {n  —  fc)-tuples  b  for 
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Figure  2.22  The  rectangular  representation  of  the  defining  table  of  a  Boolean  function  used  in 
its  (fc,  s)-Lupanov  representation. 


which  v  is  the  tuple  of  values  of  fi  when  a  £  A^.  (Note  that  the  non-empty  sets  Bi  v  for 
different  values  of  v  are  disjoint.)  Let  f^(b)  :  Bn~k  i— >  B  be  defined  as 


/S(i>) 


1  if  h  £ 

0  otherwise. 


Finally,  we  let  f$(  a)  :  Bk  i  ^  B  be  the  function  that  has  value  Vj,  the  jth  component  of  v, 
when  a  is  the  jth  fc-tuple  in  Aj\ 


ftv  (“) 


1  if  a  is  the  j  th  element  of  Ai  and  Vj  =  1 

0  otherwise. 


It  follows  that  fi(x)  =  Vu  fi/v(a)ftv  ( b ).  Given  these  definitions,  /  can  be  expanded  in 
the  following  ( k ,  s)-Lupanov  representation: 

/o)  =  V  V  ftvH  a  ftv(b)  (2-19) 

2  =  1  V 

We  now  bound  the  number  of  logic  elements  needed  to  realize  an  arbitrary  function  /  :  Bn  i— > 
B  in  this  representation. 

Consider  the  functions  /©  (a)  for  a  fixed  value  of  v.  We  construct  a  decoder  circuit  for 

the  minterms  in  a  that  has  size  at  most  2k  +  ( k  —  2)2fc/2.  Each  of  the  functions  /©  can  be 
realized  as  the  OR  of  s  minterms  in  a  for  1  <  i  <  p  —  1  and  s1  minterms  otherwise.  Thus, 
(p—  l)(s  —  1)  +  (s'  —  1)  <2k  two-input  ORs  suffice  for  all  values  of  i  and  a  fixed  value  of  v. 
Hence,  for  each  value  of  v  the  functions  f-y  can  be  realized  by  a  circuit  of  size  0{2k).  Since 
there  are  at  most  2s  choices  for  v,  all  /©  can  be  realized  by  a  circuit  of  size  0(2k+s). 

Consider  next  the  functions  /©  (b).  We  construct  a  decoder  circuit  for  the  minterms  of 
b  that  has  size  at  most  2n~k  +  (n  —  k  —  2)2^n~k^2.  Since  for  each  i,  1  <  i  <  p,  the  sets 
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Bi  v  for  different  values  of  v  are  disjoint,  (b)  can  be  realized  as  the  OR  of  at  most  2"  k 
minterms  using  at  most  2n~k  two-input  ORs.  Thus,  all  fl'y  (b),  1  <  *  <  p,  can  be  realized 
with p2n~k  +  2n~k  +  (n—k  —  2)2(""fc)/2  gates. 

Consulting  (2.19),  we  see  that  to  realize  /  we  must  add  one  AND  gate  for  each  i  and  tuple 
v.  We  must  also  add  the  number  of  two-input  OR  gates  needed  to  combine  these  products. 
Since  there  are  at  most  p2s  products,  at  least  p2s  OR  gates  are  needed  for  a  total  of  p2s+1 
gates. 

Let  Ck,s(f)  be  the  total  number  of  gates  needed  to  realize  /  in  the  ( k ,  s)-Lupanov  repre¬ 
sentation.  Ck>s(f)  satisfies  the  following  inequality: 

CkM)  <  0(2k+s)  +  0(2^-^)  +p( 2n~k  +  2S+1) 

Since  p  =  \2k / s\ ,  p  <  2 k/s  +  1,  this  expands  to 

2  n 

CkM)  <  0( 2k+s)  +  0( 2n~k)  +  —  + - 

s  s 

Now  let  k  =  [3  log2  n]  and  s  =  \n  —  5  log2  n] .  Then,  k  +  s  <  n  —  log2  n2  +  2  and 
n  —  k  <  n  —  log2  ni.  As  a  consequence,  for  large  n,  we  have 


ft„(/)<o(L)+0(5) 

We  summarize  the  result  in  a  theorem. 


2” 

(n  —  5  log2  n) 


THEOREM  2. 1  3.2  For  each  e  >  0  there  exists  some  AT0  >  1  such  that  for  all  n  >  No  every 
Boolean  function  f  :  Bn  i— >  B  has  a  circuit  size  complexity  satisfying  the  following  upper  bound: 

2n 

Cn0(f)  <  — (1  +  e) 

n 

Since  we  show  in  Section  2.12  that  for  0  <  e  <  1  almost  all  Boolean  functions  /  :  Bn  i— > 
B  have  a  circuit  size  complexity  satisfying 

2n 

Caff)  >  —(1  —  e)  —  2n2 
n 

for  n  >  2[(1  —  e)/e]  log2[(3e)2(l  —  e/2)],  this  is  a  good  lower  bound. 


Problems 

MATHEMATICAL  PRELIMINARIES 

2. 1  Show  that  the  following  identities  on  geometric  series  hold: 

(as+1  -  1) 


Ea3  = 


j= o 


(a  -  1) 


2< 

j=0  v  ’ 


sas+1  - 


(s+  l)as+  1) 


©John  E  Savage 


Problems 


83 


Figure  2.23  The  natural  logarithm  of  the  factorial  n\  is  X/fc=i  hifc,  which  is  bounded  below 
by  f™  In  a;  dx  and  above  by  f"  ln(a;  +1)  dx. 


2.2  Derive  tight  upper  and  lower  bounds  on  the  factorial  function  n\  =  n(n  —  1)  •  •  •  3  2  1. 
Hint:  Derive  bounds  on  In  n!  where  In  is  the  natural  logarithm.  Use  the  information 
given  in  Fig.  2.23. 

2.3  Let  T(g?)  be  a  complete  balanced  binary  tree  of  depth  d.  ©1),  shown  in  Fig.  2.24(a), 
has  a  root  and  two  leaves.  T(d)  is  obtained  by  attaching  to  each  of  the  leaves  ofT(l) 
copies  of  T(d  —  1).  T(3)  is  shown  in  Fig.  2.24(b). 

a)  Show  by  induction  that  T ( d )  has  2d  leaves  and  2d  —  1  non-leaf  vertices. 

b)  Show  that  any  binary  tree  (each  vertex  except  leaves  has  two  descendants)  with  n 
leaves  has  n  —  1  non-leaf  vertices  and  depth  at  least  [log2  n] . 


(a) 


(b) 


Figure  2.24 


Complete  balanced  binary  trees  a)  of  depth  one  and  b)  depth  3. 
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BINARY  FUNCTIONS  AND  LOGIC  CIRCUITS 

2.4  a)  Write  a  procedure  EXOR  in  a  language  of  your  choice  that  writes  the  description 

of  the  straight-line  program  given  in  equation  (2.2). 
b)  Write  a  program  in  a  language  of  your  choice  that  evaluates  an  arbitrary  straight- 
line  program  given  in  the  format  of  equation  (2.2)  in  which  each  input  value  is 
specified. 

2.5  A  set  of  Boolean  functions  forms  a  complete  basis  f?  if  a  logic  circuit  can  be  constructed 
for  every  Boolean  function  /  :  Bn  B  using  just  functions  in  Q. 

a)  Show  that  the  basis  consisting  of  one  function,  the  NAND  gate,  a  gate  on  two 
inputs  realizing  the  NOT  of  the  AND  of  its  inputs,  is  complete. 

b)  Determine  whether  or  not  the  basis  {AND,  OR}  is  complete. 

2.6  Show  that  the  CNF  of  a  Boolean  function  /  is  unique  and  is  the  negation  of  the  DNF 

of  7- 

2.7  Show  that  the  RSE  of  a  Boolean  function  is  unique. 

2.8  Show  that  any  SOPE  (POSE)  of  the  parity  function  has  exponentially  many  terms. 

Hint:  Show  by  contradiction  that  every  term  in  a  SOPE  (every  clause  of  a  POSE) 
of  / q'1  contains  every  variable.  Then  use  the  fact  that  the  DNF  (CNF)  of  has 
exponentially  many  terms  to  complete  the  proof. 

2.9  Demonstrate  that  the  RSE  of  the  OR  of  n  variables,  f!j'  * ,  includes  every  product  term 
except  for  the  constant  1 . 

2.10  Consider  the  Boolean  function  f^od  3  on  n  variables,  which  has  value  1  when  the  sum 
of  its  variables  is  zero  modulo  3  and  value  0  otherwise.  Show  that  it  has  exponential-size 
DNF,  CNF,  and  RSE  normal  forms. 

Hint:  Use  the  fact  that  the  following  sum  is  even: 


2.1 1  Show  that  every  Boolean  function  /(")  :  Bn  1— >  B  can  be  expanded  as  follows: 

f{xux2,  ...,xn)  =  xif(l,x2,  ...,xn)V  X\f(0,  x2,  ■  ■  ■ ,  xn) 

Apply  this  expansion  to  each  variable  of  f(x\,x 2,2:3)  =  X\X2  V  £2X3  to  obtain  its 
DNF. 

2.12  In  a  dual-rail  logic  circuit  0  and  1  are  represented  by  the  pairs  (0, 1)  and  (1,  0),  re¬ 
spectively.  A  variable  X  is  represented  by  the  pair  (x,x).  A  NOT  in  this  representation 
(called  a  DRL-NOT)  is  a  pair  of  twisted  wires. 

a)  How  are  AND  (DRL-AND)  and  OR  (DRL-OR)  realized  in  this  representation?  Use 
standard  AND  and  OR  gates  to  construct  circuits  for  gates  in  the  new  representa¬ 
tion.  Show  that  every  function  f  :  Bn  1— >  Bm  can  be  realized  by  a  dual-rail  logic 
circuit  in  which  the  standard  NOT  gates  are  used  only  on  input  variables  (to  obtain 
the  pair  (x,  x)). 
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b)  Show  that  the  size  and  depth  of  a  dual-rail  logic  circuit  for  a  function  /  :  Bn  i— >  B 
are  at  most  twice  the  circuit  size  (plus  the  NOTs  for  the  inputs)  and  at  most  one 
more  than  the  circuit  depth  of  /  over  the  basis  {AND,  OR,  NOT},  respectively. 

2.13  A  function  /  :  Bn  i— >  B  is  monotone  if  for  all  1  <  j  <  n,  f(x\, .  . . ,  Xj-\,  0,  Xj+l, 
.  .  . ,  Xn )  <  f(x  i,  .  . . ,  Xj- 1,  l,Xj+i, .  .  . ,  Xn )  for  all  values  of  the  remaining  variables; 
that  is,  increasing  any  variable  from  0  to  1  does  not  cause  the  function  to  decrease  its 
value  from  1  to  0. 

a)  Show  that  every  circuit  over  the  basis  flmon  =  {AND,  OR}  computes  monotone 
functions  at  every  gate. 

b)  Show  that  every  monotone  function  /(")  :  Bn  i— >  B  can  be  expanded  as  follows: 

f(x  i,x2,  ...,xn)  =  xif(\,x2,...,xn)  V  /( 0,x2, . .  .,x„) 

Show  that  this  implies  that  every  monotone  function  can  be  realized  by  a  logic  circuit 
over  the  monotone  basis  flmon  =  {AND,  OR}. 

SPECIALIZED  FUNCTIONS 

2. 14  Complete  the  proof  of  Lemma  2.5.3  by  solving  the  recurrences  stated  in  Equation  (2.4). 

2.15  Design  a  multiplexer  circuit  of  circuit  size  2n+1  plus  lower-order  terms  when  n  is  even. 
Hint:  Construct  a  smaller  circuit  by  applying  the  decomposition  given  in  Section  2.5.4 
of  the  minterms  of  n  variables  into  minterms  on  the  two  halves  of  the  n  variables. 

2.16  Complete  the  proof  of  Lemma  2.1 1.1  by  establishing  the  correctness  of  the  inductive 
hypothesis  stated  in  its  proof. 

2.17  The  binary  sorting  function  is  defined  in  Section  2.11.  Show  that  it  can  be  realized 
with  a  circuit  whose  size  is  0(n)  and  depth  is  0(log  n). 

Hint:  Consider  using  a  circuit  for  ,  a  decoder  circuit  and  other  circuitry.  Is  there 
a  role  for  a  prefix  computation  in  this  problem? 

LOGICAL  FUNCTIONS 

2.18  Let  f member  :  £(”+1)b  ^B  be  defined  below. 

jn)  ,  f  1  Xj  =  y  for  some  \<i<n 

. =  {  0  otherwise 

where  Xj,  y  £  Bb  and  Xj  =  y  if  and  only  if  they  agree  in  each  position. 

Obtain  good  upper  bounds  to  Cq  (/^”|nber)  and  Dn  (f^Lber)  by  constructing  a 
circuit  over  the  basis  f l  =  {A,  V,  ®}. 

2. 19  Design  a  circuit  to  compare  two  n-bit  binary  numbers  and  return  the  value  1  if  the  first 
is  larger  than  or  equal  to  the  second  and  0  otherwise. 

Hint:  Compare  each  pair  of  digits  of  the  same  significance  and  generate  three  out¬ 
comes,  yes,  maybe,  and  no,  corresponding  to  whether  the  first  digit  is  greater  than, 
equal  to  or  less  than  the  second.  How  can  you  combine  the  outputs  of  such  a  compar¬ 
ison  circuit  to  design  a  circuit  for  the  problem?  Does  a  prefix  computation  appear  in 
your  circuit? 
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2.20  a)  Let  ©copy  '■  S2  S  be  the  operation 


©copy  ^  —  ri 

Show  that  (S,  ©copy)  is  a  semigroup  for  S  an  arbitrary  non-empty  set. 
b)  Let  •  denote  string  concatenation  over  the  set  {0,  1}*  of  binary  strings.  Show  that 
it  is  associative. 

2.21  The  segmented  prefix  computation  with  the  associative  operation  0  on  a  “value”  n- 
vector  x  over  a  set  S,  given  a  “flag  vector”  0  over  B,  is  defined  as  follows:  the  value 
of  the  ith  entry  yi  of  the  “result  vector”  y  is  Xi  if  its  flag  is  1  and  otherwise  is  the 
associative  combination  with  0  of  Xi  and  the  entries  to  its  left  up  to  and  including  the 
first  occurrence  of  a  1  in  the  flag  array.  The  leftmost  bit  in  every  flag  vector  is  1 .  An 
example  of  a  segmented  prefix  computation  is  given  in  Section  2.6. 

Assuming  that  ( S ,  ©)  is  a  semigroup,  a  segmented  prefix  computation  over  the  set 
S  x  B  of  pairs  is  a  special  case  of  general  prefix  computation.  Consider  the  operator  © 
on  pairs  ( Xi ,  0,)  of  values  and  flags  defined  below: 


((Zl>0l)  ©  (®2.02))  = 


(x2, 1)  02  =  1 

(xi  0  X2,  0l)  02  =  0 


Show  that  {{S,  B),  ®)  is  a  semigroup  by  proving  that  ( S ,  B)  is  closed  under  the  oper¬ 
ator  ©  and  that  the  operator  ©  is  associative. 

2.22  Construct  a  logic  circuit  of  size  0(n  log  n)  and  depth  0( log2  n)  that,  given  a  binary  n- 
tuple  x,  computes  the  n-tuple  y  containing  the  running  sum  of  the  number  of  1  ’s  in  x. 

2.23  Given  2 n  Boolean  variables  organized  as  pairs  0 a  or  la,  design  a  circuit  that  moves  pairs 
of  the  form  la  to  the  left  and  the  others  to  the  right  without  changing  their  relative 
order.  Show  that  the  circuit  has  size  0(n  log2  n). 

2.24  Linear  recurrences  play  an  important  role  in  many  problems  including  the  solution 
of  a  tridiagonal  linear  system  of  equations.  They  are  defined  over  “near-rings,”  which 
are  slightly  weaker  than  rings  in  not  requiring  inverses  under  the  addition  operation. 
(Rings  are  defined  in  Section  6.2.1.) 

A  near-ring  ( 1Z ,  •,  +)  is  a  set  1Z  together  with  an  associative  multiplication  operator  • 
and  an  associative  and  commutative  addition  operator  +.  (If  +  is  commutative,  then 
for  all  a,  b  €  1Z,  a  +  b  =  b  +  a.)  In  addition,  •  distributes  over  +;  that  is,  for  all 
a,  b,  c  £  7Z,  a  ■  (b  +  c)  =  a  ■  b  +  a  ■  c. 

A  first-order  linear  recurrence  of  length  n  is  an  n-tuple  x  =  (x] ,  x2,  ■  ■  ■ ,  xn)  of  vari¬ 
ables  over  a  near-ring  (JZ,  •,  +)  that  satisfies  X\  =  b\  and  the  following  set  of  identities 
for  2  <  j  <  n  defined  in  terms  of  elements  {a,j,  bj  £lZ  \  2  <  j  <  n}: 

x  j  —  a  j  *  x j — i  — (—  bj 

Use  the  ideas  of  Section  2.7  on  carry-lookahead  addition  to  show  that  Xj  can  be  written 

Xj  =  Cj  •  X\  +  dj 

where  the  pairs  ( Cj ,  dj)  are  the  result  of  a  prefix  computation. 
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2.25  Design  a  circuit  that  finds  the  most  significant  non-zero  position  in  an  n-bit  binary 
number  and  logically  shifts  the  binary  number  left  so  that  the  non-zero  bit  is  in  the  most 
significant  position.  The  circuit  should  produce  not  only  the  shifted  binary  number  but 
also  a  binary  representation  of  the  amount  of  the  shift. 

2.26  Consider  the  function  n\j,  k\  =  7r[j,  k  —  1]  o  7t[fc,  k]  for  1  <  j  <  k  <  n  —  1,  where  o 
is  defined  in  Section  2.7.1.  Show  by  induction  that  the  first  component  of  7r[j,  k]  is  1 
if  and  only  if  a  carry  propagates  through  the  full  adder  stages  numbered  j,  j  +  l, ...  ,k 
and  its  second  component  is  1  if  and  only  if  a  carry  is  generated  at  one  of  these  stages, 
propagates  through  subsequent  stages,  and  appears  as  a  carry  out  of  the  fcth  stage. 

2.27  Give  a  construction  of  a  circuit  for  subtracting  one  n-bit  positive  binary  integer  from 
another  using  the  two’s-complement  operation.  Show  that  the  circuit  has  size  0(n ) 
and  depth  O(logn). 

2.28  Complete  the  proof  of  Theorem  2.9.3  outlined  in  the  text.  In  particular,  solve  the 
recurrence  given  in  equation  (2.10). 

2.29  Show  that  the  depth  bound  stated  in  Theorem  2.9.3  can  be  improved  from  0( log2  n ) 
to  0(log  n)  without  affecting  the  size  bound  by  using  carry-save  addition  to  form  the 
six  additions  (or  subtractions)  that  are  involved  at  each  stage. 

Hint:  Observe  that  each  multiplication  of  (n/2)-bit  numbers  at  the  top  level  is  ex¬ 
panded  at  the  next  level  as  sums  of  the  product  of  (n/4)-bit  numbers  and  that  this  type 
of  replacement  continues  until  the  product  is  formed  of  1-bit  numbers.  Observe  also 
that  2n-bit  carry-save  adders  can  be  used  at  the  top  level  but  that  the  smaller  carry-save 
adders  can  be  used  at  successively  lower  levels. 

2.30  Residue  arithmetic  can  be  used  to  add  and  subtract  integers.  Given  positive  relatively 
prime  integers P\,P2,  ■  ■  ■  ,Pk  (no  common  factors),  an  integer  n  in  the  set  {0,  1,2,..., 
N  —  l},  N  =  P1P2  •  •  -pk,  can  be  represented  by  the  k  -tuple  n  =  (n\,ri2,  ■  ■  ■  ,nk), 
where  rij  =  n  mod  pj .  Let  n  and  m  be  in  this  set. 

a)  Show  that  if  n  ^  m,  n  m. 

b)  Form  n  +  mby  adding  corresponding  jth  components  modulo  pj.  Show  that 
n  +  m  uniquely  represents  (n  +  m)  mod  N. 

c)  Form  n  x  m  by  multiplying  corresponding  j  th  components  of  n  and  m  modulo 
Pj.  Show  that  n  x  m  is  the  unique  representation  for  (nm)  mod  N . 

2.31  Use  the  circuit  designed  in  Problem  2.19  to  build  a  circuit  that  adds  two  n-bit  binary 
numbers  modulo  an  arbitrary  third  n-bit  binary  number.  You  may  use  known  circuits. 

2.32  In  prime  factorization  an  integer  n  is  represented  as  the  product  of  primes.  Let  p{N) 

be  the  largest  prime  less  than  N.  Then,  n  £  {2, . .  . ,  N  —  1}  is  represented  by  the 
exponents  {e.2,  ■  ■  ■,  ep(jv))>  where  n  =  2e23fi3  . . .  p(N)ep<-N'> .  The  representation 

for  the  product  of  two  integers  in  this  system  is  the  sum  of  the  exponents  of  their 
respective  prime  factors.  Show  that  this  leads  to  a  multiplication  circuit  whose  depth 
is  proportional  to  log  log  log  N.  Determine  the  size  of  the  circuit  using  the  fact  that 
there  are  0(N/  log  N)  primes  in  the  set  {2,  ...,7V—  1}. 
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2.33  Construct  a  circuit  for  the  division  of  two  n-bit  binary  numbers  from  circuits  for  the 

reciprocal  function  and  the  integer  multiplication  function  Determine 

the  size  and  depth  of  this  circuit  and  the  accuracy  of  the  result. 

2.34  Let  /  :  Bn  Bkn  be  an  integer  power  of  x\  that  is,  f(x)  =  xk  for  some  integer  k. 
Show  that  such  functions  contain  the  shifting  function  as  a  subfunction  for  some 
integer  to.  Determine  to  dependent  on  n  and  k. 

2.35  Let  /  :  Bn  t— >  Bn  be  a  fractional  power  of  x  of  the  form  f(x)  =  \xqtlk~\,  0  < 
q  <  2k  <  log2  n.  Show  that  this  function  contains  the  shifting  function  as  a 
subfunction.  Find  the  largest  value  of  m  for  which  this  holds. 


Chapter  Notes 

Logic  circuits  have  a  long  history.  Early  in  the  nineteenth  century  Babbage  designed  me¬ 
chanical  computers  capable  of  logic  operations.  In  the  twentieth  century  logic  circuits,  called 
switching  circuits,  were  constructed  of  electromechanical  relays.  The  earliest  formal  analysis  of 
logic  circuits  is  attributed  to  Claude  Shannon  [306];  he  applied  Boolean  algebra  to  the  analysis 
of  logic  circuits,  the  topic  of  Section  2.2.  Reduction  between  problems,  a  technique  central 
to  computer  science,  is  encountered  whenever  one  uses  an  existing  program  to  solve  a  new 
problem  by  pre-processing  inputs  and  post-processing  outputs.  Reductions  also  provide  a  way 
to  identify  problems  with  similar  complexity,  an  idea  given  great  importance  by  the  work  of 
Cook  [74],  Karp  [159],  and  Levin  [199]  on  NP-completeness.  (See  also  [335].)  This  topic  is 
explored  in  depth  in  Chapter  8. 

The  upper  bound  on  the  size  of  ripple  adder  described  in  Section  2.7  cannot  be  improved, 
as  shown  by  Red’kin  [276]  using  the  gate  elimination  method  of  Section  9.3.2.  Prefix  compu¬ 
tations,  the  subject  of  Section  2.6,  were  first  used  by  Ofman  [234],  Lie  constructed  the  adder 
based  on  carry-lookahead  addition  described  in  Section  2.7.  Krapchenko  [173]  and  Brent 
[57]  developed  adders  with  linear  size  whose  depth  is  [log  n]  +  0(y/\ log  n] ),  asymptotically 
almost  as  good  at  the  best  possible  depth  bound  of  [log  n\ . 

Ofman  used  carry-save  addition  for  fast  integer  multiplication  [234].  Wallace  indepen¬ 
dently  discovered  carry-save  addition  and  logarithmic  depth  circuits  for  addition  and  multipli¬ 
cation  [356].  The  divide-and-conquer  integer  multiplication  algorithm  of  Section  2.9.2  is  due 
to  Karatsuba  [155].  As  mentioned  at  the  end  of  Section  2.9,  Schonhage  and  Strassen  [303] 
have  designed  binary  integer  multipliers  of  depth  0(log  n)  whose  size  is  0(n  log  n  log  log  n ) . 

Sir  Isaac  Newton  around  1665  invented  the  iterative  method  bearing  his  name  used  in 
Section  2.10  for  binary  integer  division.  Our  treatment  of  this  idea  follows  that  given  by  Tate 
[325].  Reif  and  Tate  [278]  have  shown  that  binary  integer  division  can  be  done  with  circuit 
size  0(n  log  n  log  log  n)  and  depth  0(log  n  log  log  n)  using  circuits  whose  description  is  log- 
space  uniform.  Beame,  Cook,  and  Hoover  [33]  have  given  an  O  (log  n)  -depth  circuit  for  the 
reciprocal  function,  the  best  possible  depth  bound  up  to  a  constant  multiple,  but  one  whose 
size  is  polynomial  in  n  and  whose  description  is  not  uniform;  it  requires  knowledge  of  about 
n2/  log  n  primes. 

The  key  result  in  Section  2.11  on  symmetric  functions  is  due  to  Muller  and  Preparata 
[226] .  As  indicated,  it  is  the  basis  for  showing  that  every  one-output  symmetric  function  can 
be  realized  by  a  circuit  whose  size  and  depth  are  linear  and  logarithmic,  respectively. 
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Shannon  [307]  developed  lower  bounds  for  two-terminal  switching  circuits  of  the  type 
given  in  Section  2.12  on  circuit  size.  Muller  [224]  extended  the  techniques  of  Shannon  to 
derive  the  lower  bounds  on  circuit  size  given  in  Theorem  2.12.1.  Shannon  and  Riordan  [281] 
developed  a  lower  bound  of  fl(2™/  log  n)  on  the  size  of  Boolean  formulas,  circuits  in  which  the 
fan-out  of  each  gate  is  1 .  As  seen  in  Chapter  9,  such  bounds  readily  translate  into  lower  bounds 
on  depth  of  the  form  given  Theorem  2.12.2.  Gaskov,  using  the  Lupanov  representation,  has 
derived  a  comparable  upper  bound  [110]. 

The  upper  bound  on  circuit  size  given  in  Section  2.13  is  due  to  Lupanov  [208],  Shannon 
and  Riordan  [281]  show  that  a  lower  bound  of  fl(2ra/logn)  must  apply  to  the  formula  size 
(see  Definition  9.1.1)  of  most  Boolean  functions  on  n  variables.  Given  the  relationship  of 
Theorem  9.2.2  between  formula  size  and  depth,  a  depth  lower  bound  of  n  —  log  log  n  —  0(l) 
follows. 

Early  work  on  circuits  and  circuit  complexity  is  surveyed  by  Paterson  [237]  and  covered  in 
depth  by  Savage  [287] .  More  recent  coverage  of  this  subject  is  contained  in  the  survey  article 
by  Bopanna  and  Sipser  [50]  and  books  by  Wegener  [360]  and  Dunne  [92] . 
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Machines  with  Memory 


As  we  saw  in  Chapter  1,  every  finite  computational  task  can  be  realized  by  a  combinational 
circuit.  While  this  is  an  important  concept,  it  is  not  very  practical;  we  cannot  afford  to  design 
a  special  circuit  for  each  computational  task.  Instead  we  generally  perform  computational  tasks 
with  machines  having  memory.  In  a  strong  sense  to  be  explored  in  this  chapter,  the  memory  of 
such  machines  allows  them  to  reuse  their  equivalent  circuits  to  realize  functions  of  high  circuit 
complexity. 

In  this  chapter  we  examine  the  deterministic  and  nondeterministic  finite-state  machine 
(FSM),  the  random-access  machine  (RAM),  and  the  Turing  machine.  The  finite-state  machine 
moves  from  state  to  state  while  reading  input  and  producing  output.  The  RAM  has  a  central 
processing  unit  (CPU)  and  a  random-access  memory  with  the  property  that  each  memory 
word  can  be  accessed  in  one  unit  of  time.  Its  CPU  executes  instructions,  reading  and  writing 
data  from  and  to  the  memory.  The  Turing  machine  has  a  control  unit  that  is  a  finite-state 
machine  and  a  tape  unit  with  a  head  that  moves  from  one  tape  cell  to  a  neighboring  one  in 
each  unit  of  time.  The  control  unit  reads  from,  writes  to,  and  moves  the  head  of  the  tape  unit. 

We  demonstrate  through  simulation  that  the  RAM  and  the  Turing  machine  are  universal 
in  the  sense  that  every  finite-state  machine  can  be  simulated  by  the  RAM  and  that  it  and  the 
Turing  machine  can  simulate  each  other.  Since  they  are  equally  powerful,  either  can  be  used  as 
a  reference  model  of  computation. 

We  also  simulate  with  circuits  computations  performed  by  the  FSM,  RAM,  and  Turing 
machine.  These  circuit  simulations  establish  two  important  results.  First,  they  show  that  all 
computations  are  constrained  by  the  available  resources,  such  as  space  and  time.  For  example, 
if  a  function  /  is  computed  in  T  steps  by  the  RAM  with  storage  capacity  S  (in  bits),  then  S 
and  T  must  satisfy  the  inequality  Cq(/)  =  O(ST),  where  Cn(/)  is  the  size  of  the  smallest 
circuit  for  /  over  the  complete  basis  Q.  Any  attempt  to  compute  /  on  the  RAM  using  space 
S  and  time  T  whose  product  is  too  small  will  fail.  Second,  an  0(log  ST)- space,  0{ST)- time 
program  exists  to  write  the  descriptions  of  circuits  simulating  the  above  machines.  This  fact 
leads  to  the  identification  in  this  chapter  of  the  first  examples  of  P-complete  and  NP-complete 
problems. 
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3.1  Finite-State  Machines 

The  finite-state  machine  (FSM)  has  a  set  of  states,  one  of  which  is  its  initial  state.  At  each  unit 
of  time  an  FSM  is  given  a  letter  from  its  input  alphabet.  This  causes  the  machine  to  move 
from  its  current  state  to  a  potentially  new  state.  While  in  a  state,  the  FSM  produces  a  letter 
from  its  output  alphabet.  Such  a  machine  computes  the  function  defined  by  the  mapping 
from  its  initial  state  and  strings  of  input  letters  to  strings  of  output  letters.  FSMs  can  also  be 
used  to  accept  strings,  as  discussed  in  Chapter  4.  Some  states  are  called  final  states.  A  string 
is  recognized  (or  accepted)  by  an  FSM  if  the  last  state  entered  by  the  machine  on  that  input 
string  is  a  final  state.  The  language  recognized  (or  accepted)  by  an  FSM  is  the  set  of  strings 
accepted  by  it.  We  now  give  a  formal  definition  of  an  FSM. 

DEFINITION  3.1.1  A  finite-state  machine  (FSM)  M  is  a  seven-tuple  M  =  (E,  T,  Q,  S,  A,  s, 
F),  where  E  is  the  input  alphabet,  T  is  the  output  alphabet,  Q  is  the  finite  set  of  states, 
S  :  Q  x  E  i— >  Q  is  the  next-state  function,  A  :  Q  i— >  tp  is  the  output  function,  s  is  the  initial 
state  (which  may  be  fixed  or  variable),  and  F  is  the  set  ofi  final  states  [F  C  Q).  If  the  FSM  is 
given  input  letter  a  when  in  state  q,  it  enters  state  5(q,  a).  While  in  state  q  it  produces  the  output 
letter  A  (q). 

The  FSM  M  accepts  the  string  treE*  if  the  last  state  entered  by  M  on  the  input  string  w 
starting  in  state  s  is  in  the  set  F.  M  recognizes  (or  accepts,)  the  language  L  consisting  ofi  the  set 
of  such  strings. 

When  the  initial  state  of  the  FSM  M  is  not  fixed,  for  each  integer  T  M  maps  the  initial  state 
s  and  its  T  external  inputs  w  | ,  w2, . . . ,  Wt  onto  its  T  external  outputs  y , ,  y2,  ■  •  ■ ,  ?/y  and  the 
final  state  q(T\  We  say  that  in  T  steps  the  FSM  M  computes  the  function  fffl  ■  Q  x  ET  i— > 
Q  x  T  7  .  It  is  assumed  that  the  sets  E,  T,  and  Q  are  encoded  in  binary  so  that  f^p  is  a  binary 
function. 

The  next-state  and  output  functions  of  an  FSM,  S  and  A,  can  be  represented  as  in  Fig.  3.1. 
We  visualize  these  functions  taking  a  state  value  from  a  memory  and  an  input  value  from  an 
external  input  and  producing  next-state  and  output  values.  Next-state  values  are  stored  in  the 
memory  and  output  values  are  released  to  the  external  world.  From  this  representation  an 
actual  machine  (a  sequential  circuit)  can  be  constructed  (see  Section  3.3).  Once  circuits  are 
constructed  for  S  and  A,  we  need  only  add  memory  units  and  a  clock  to  construct  a  sequential 
circuit  that  emulates  an  FSM. 


Output 


Figure  3. 1  The  finite-state  machine  model. 
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Figure  3.2  A  finite-state  machine  computing  the  EXCLUSIVE  OR  of  its  inputs. 


An  example  of  an  FSM  is  shown  in  Fig.  3.2.  Its  input  and  output  alphabets  and  state 
sets  are  E  =  {0,  1 },  \I/  =  {0,  1},  and  Q  =  {go,gi},  respectively.  Its  next-state  and  output 
functions,  S  and  A,  are  given  below. 
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The  FSM  has  initial  state  go  and  final  state  gi-  As  a  convenience  we  explicitly  identify  final 
states  by  shading,  although  in  practice  they  can  be  associated  with  states  producing  a  particular 
output  letter. 

Each  state  has  a  label  qj  /  Vj ,  where  qj  is  the  name  of  the  state  and  Vj  is  the  output  produced 
while  in  this  state.  The  initial  state  has  an  arrow  labeled  with  the  word  “start”  pointing  to 
it.  Clearly,  the  set  of  strings  accepted  by  this  FSM  are  those  containing  an  odd  number  of 
instances  of  1.  Thus  it  computes  the  EXCLUSIVE  OR  function  on  an  arbitrary  number  of 
inputs. 

While  it  is  conventional  to  think  of  the  finite-state  machine  as  a  severely  restricted  com¬ 
putational  model,  it  is  actually  a  very  powerful  one.  The  random-access  machine  (RAM) 
described  in  Section  3.4  is  an  FSM  when  the  number  of  memory  locations  that  it  contains 
is  bounded,  as  is  always  so  in  practice.  When  a  program  is  first  placed  in  the  memory  of 
the  RAM,  the  program  sets  the  initial  state  of  the  RAM.  The  RAM,  which  may  or  may  not 
read  external  inputs  or  produce  external  outputs,  generally  will  leave  its  result  in  its  memory; 
that  is,  the  result  of  the  computation  often  determines  the  final  state  of  the  random-access 
machine. 

The  FSM  defined  above  is  called  a  Moore  machine  because  it  was  defined  by  E.F.  Moore 
[223]  in  1956.  An  alternative  FSM,  the  Mealy  machine  (defined  by  Mealy  [215]  in  1955), 
has  an  output  function  A*  :  Q  X  E  i— >  T  that  generates  an  output  on  each  transition  from 
one  state  to  another.  This  output  is  determined  by  both  the  state  in  which  the  machine  resides 
before  the  state  transition  and  the  input  letter  causing  the  transition.  It  can  be  shown  that  the 
two  machine  models  are  equivalent  (see  Problem  3.6):  any  computation  one  can  do,  the  other 
can  do  also. 
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3.1.1  Functions  Computed  by  FSMs 

We  now  examine  the  ways  in  which  an  FSM  might  compute  a  function.  Since  our  goal  is  to 
understand  the  power  and  limits  of  computation,  we  must  be  careful  not  to  assume  that  an 
FSM  can  have  hidden  access  to  an  external  computing  device.  All  computing  devices  must 
be  explicit.  It  follows  that  we  allow  FSMs  only  to  compute  functions  that  receive  inputs  and 
produce  outputs  at  data-independent  times. 

To  understand  the  function  computed  by  an  FSM  M,  observe  that  in  initial  state  q ^  =  s 
and  receiving  input  letter  w i,  M  enters  state  q =  5(q^°\wi)  and  produces  output  y\  = 
X(q^).  If  M  then  receives  input  W2,  it  enters  state  q ^  =  S(q^l\w2)  and  produces  output 
t/2  =  X(q^).  Repeated  applications  of  the  functions  <5  and  A  on  successive  states  with  suc¬ 
cessive  inputs,  as  suggested  by  Fig.  3.3,  generate  the  outputs  J/i,  ?/2>  ■  •  ■  >  Vt  and  the  final  state 
q(T\  The  function  :  Q  X  ST  1 —>  Q  X  T7  given  in  Definition  3.1.1  defines  this  mapping 
from  an  initial  state  and  inputs  to  the  final  state  and  outputs: 

/!?  (q(0\wuw2,...,WT^  =  [q{T),y\,V2,---,yTSj 

This  simulation  of  a  machine  with  memory  by  a  circuit  illustrates  a  fundamental  point  about 
computation,  namely,  that  the  role  of  memory  is  to  hold  intermediate  results  on  which  the 
logical  circuitry  of  the  machine  can  operate  in  successive  cycles. 

When  an  FSM  M  is  used  in  a  T-step  computation,  it  usually  does  not  compute  the  most 

(T) 

general  function  fyM  ;  that  it  can.  Instead,  some  restrictions  are  generally  placed  on  the  possible 
initial  states,  on  the  values  of  the  external  inputs  provided  to  M,  and  on  the  components  of 
the  final  state  and  output  letters  used  in  the  computation.  Consider  three  examples  of  the 
specialization  of  an  FSM  to  a  particular  task.  In  the  first,  let  the  FSM  model  be  that  shown  in 
Fig.  3.2  and  let  it  be  used  to  form  the  EXCLUSIVE  OR  of  n  variables.  In  this  case,  we  supply  n 
bits  to  the  FSM  but  ignore  all  but  the  last  output  value  it  produces.  In  the  second  example,  let 
the  FSM  be  a  programmable  machine  in  which  a  program  is  loaded  into  its  memory  before  the 
start  of  a  computation,  thereby  setting  its  initial  state.  The  program  ignores  all  external  inputs 
and  produces  no  output,  leaving  the  value  of  the  function  in  memory.  In  the  third  example, 
again  let  the  FSM  be  programmable,  but  let  the  program  that  resides  initially  residing  in  its 
memory  be  a  “boot  program”  that  treats  its  inputs  as  program  statements.  (Thus,  the  FSM 
has  a  fixed  initial  state.)  The  boot  program  forms  a  program  by  loading  these  statements  into 
successive  memory  locations.  It  then  jumps  to  the  first  location  in  this  program. 

In  each  of  these  examples,  the  function  /  that  is  actually  computed  by  M  in  T  steps  is 
a  subfunction  of  the  function  /jp  because  /  is  obtained  by  either  restricting  the  values  of 


Wt 


Figure  3.3  A  circuit  computing  the  same  function,  /jp,  as  a  finite-state  machine  M  in  T 
steps. 
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the  initial  state  and  inputs  to  M  or  deleting  outputs  or  both.  We  assume  that  every  function 

(T) 

computed  by  M  in  T  steps  is  a  subfunction  /  of  the  function  fyM  . 

The  simple  construction  of  Fig.  3.3  is  the  first  step  in  deriving  a  space-time  product  in¬ 
equality  for  the  random-access  machine  in  Section  3.5  and  in  establishing  a  connection  be¬ 
tween  Turing  time  and  circuit  complexity  in  Section  3.9.2.  It  is  also  involved  in  the  definition 
of  the  P-complete  and  NP-complete  problems  in  Section  3.9.4. 

3.1.2  Computational  Inequalities  for  the  FSM 

In  this  book  we  model  each  computational  task  by  a  function  that,  we  assume  without  loss 
of  generality,  is  binary.  We  also  assume  that  the  function  /,©  :  Q  x  ST  h*  Q  X 
computed  in  T  steps  by  an  FSM  M  is  binary.  In  particular,  we  assume  that  the  next-state 
and  output  functions,  6  and  A,  are  also  binary;  that  is,  we  assume  that  their  input,  state,  and 
output  alphabets  are  encoded  in  binary.  We  now  derive  some  consequences  of  the  fact  that  a 
computation  by  an  FSM  can  be  simulated  by  a  circuit. 

The  size  Cq  f  °f  the  smallest  circuit  to  compute  the  function  /©  is  no  larger  than 

the  size  of  the  circuit  shown  in  Fig.  3.3.  But  this  circuit  has  size  T  ■  Cq(S,  A),  where  Cq(5,  A) 
is  the  size  of  the  smallest  circuit  to  compute  the  functions  S  and  A.  The  depth  of  the  shallowest 
circuit  for  fyM ;  is  no  more  than  T  ■  Dq(S,  A)  because  the  longest  path  through  the  circuit  of 
Fig.  3.3  has  this  length. 

(T) 

Let  /  be  the  function  computed  by  M  in  T  steps.  Since  it  is  a  subfunction  of 
it  follows  from  Lemma  2.4.1  that  the  size  of  the  smallest  circuit  for  /  is  no  larger  than  the 
size  of  the  circuit  for  ^  .  Similarly,  the  depth  of  /,  -Dn(/)>  is  no  more  than  that  of  fyM  ■ 
Combining  the  observations  of  this  paragraph  with  those  of  the  preceding  paragraph  yields  the 
following  computational  inequalities.  A  computational  inequality  is  an  inequality  relating 
parameters  of  computation,  such  as  time  and  the  circuit  size  and  depth  of  the  next-state  and 
output  function,  to  the  size  or  depth  of  the  smallest  circuit  for  the  function  being  computed. 

THEOREM  3.1.1  Let  /j©  be  the  function  computed  by  the  FSM  M  =  (£,  T,  Q,  S,  A,  s,  F )  in 
T  steps,  where  S  and  A  are  the  binary  next-state  and  output  functions  ofM.  The  circuit  size  and 
depth  over  the  basis  LI  of  any  function  f  computed  by  M  in  T  steps  satisfy  the  following  inequalities: 

CaU)  <  Cn  (/IP)  <  TCn(5,  A) 

Dn(f)  <  Ai(/iP)  <  TDn(6,  A) 

The  circuit  size  Cq(5,  A)  and  depth  Dq(S,  A)  of  the  next-state  and  output  functions  of  an 
FSM  M  are  measures  of  its  complexity,  that  is,  of  how  useful  they  are  in  computing  functions. 
The  above  theorem,  which  says  nothing  about  the  actual  technologies  used  to  realize  M,  re¬ 
lates  these  two  measures  of  the  complexity  of  M  to  the  complexities  of  the  function  /  being 
computed.  This  is  a  theorem  about  computational  complexity,  not  technology. 

These  inequalities  stipulate  constraints  that  must  hold  between  the  time  T  and  the  circuit 
size  and  depth  of  the  machine  M  if  it  is  used  to  compute  the  function  f  in  T  steps.  Let  the 
product  TCq(S,  A)  be  defined  as  the  equivalent  number  of  logic  operations  performed  by 
M.  The  first  inequality  of  the  above  theorem  can  be  interpreted  as  saying  that  the  number  of 
equivalent  logic  operations  performed  by  an  FSM  to  compute  a  function  /  must  be  at  least 
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the  minimum  number  of  gates  necessary  to  compute  /  with  a  circuit.  A  similar  interpretation 
can  be  given  to  the  second  inequality  involving  circuit  depth. 

The  first  inequality  of  Theorem  3.1.1  and  the  interpretation  given  to  T  ■  Cn(d,  A)  justify 
the  following  definitions  of  computational  work  and  power.  Here  power  is  interpreted  as 
the  time  rate  at  which  work  is  done.  These  measures  correlate  nicely  with  our  intuition  that 
machines  that  contain  more  equivalent  computing  elements  are  more  powerful. 

DEFINITION  3. 1 .2  The  computational  work  done  by  an  FSM  M  =  (£,  >k,  Q,  S,  A,  s,  F )  is 
TCci(5,  A),  the  number  of  equivalent  logical  operations  performed  by  M,  which  is  the  product  of 
T,  the  number  of steps  executed  by  M ,  andCo,(8,X),  the  size  complexity  of  its  next-state  and  output 
functions.  The  power  of  an  FSM  M  is  Cq(5,  A),  the  number  of  logical  operations  performed  by 
M  per  step. 

Theorem  3.1.1  is  also  a  form  of  impossibility  theorem:  it  is  impossible  to  compute  func¬ 
tions  /  for  which  TCq(S,  A)  and  TDq(S,  A)  are  respectively  less  than  the  size  and  depth 
complexity  of  /.  It  may  be  possible  to  compute  a  function  on  some  points  of  its  domain 
with  smaller  values  of  these  parameters,  but  not  on  all  points.  The  halting  problem,  another 
example  of  an  impossibility  theorem,  is  presented  in  Section  5.8.2.  However,  it  deals  with  the 
computation  of  functions  over  infinite  domains. 

The  inequalities  of  Theorem  3.1.1  also  place  upper  limits  on  the  size  and  depth  complex¬ 
ities  of  functions  that  can  be  computed  in  a  bounded  number  of  steps  by  an  FSM,  regardless 
of  how  the  FSM  performs  the  computation. 

Note  that  there  is  no  guarantee  that  the  upper  bounds  stated  in  Theorem  3.1.1  are  at  all 
close  to  the  lower  bounds.  It  is  always  possible  to  compute  a  function  inefficiently,  that  is,  with 
resources  that  are  greater  than  the  minimal  resources  necessary. 

3.1.3  Circuits  Are  Universal  for  Bounded  FSM  Computations 

We  now  ask  whether  the  classes  of  functions  computed  by  circuits  and  by  FSMs  executing 
a  bounded  number  of  steps  are  different.  We  show  that  they  are  the  same.  Many  different 
functions  can  be  computed  from  the  function  by  specializing  inputs  and/or  deleting 

outputs. 

THEOREM  3. 1 .2  Every  subfimction  of  the  function  computable  by  an  FSM  on  n  inputs  is 

computable  by  a  Boolean  circuit  and  vice  versa. 

Proof  A  Boolean  function  on  n  inputs,  /,  may  be  computed  by  an  FSM  with  2n+1  —  1 
states  by  branching  from  the  current  state  to  one  of  two  different  states  on  inputs  0  and  1 
until  all  n  inputs  have  been  read;  it  then  produces  the  output  that  would  be  produced  by  / 
on  these  n  inputs.  A  fifteen-state  version  of  this  machine  that  computes  the  EXCLUSIVE  OR 
on  three  inputs  as  a  subfunction  is  shown  in  Fig.  3.4. 

The  proof  in  the  other  direction  is  also  straightforward,  as  described  above  and  repre¬ 
sented  schematically  in  Fig.  3.3.  Given  a  binary  representation  of  the  input,  output,  and  state 
symbols  of  an  FSM,  their  associated  next-state  and  output  functions  are  binary  functions. 
They  can  be  realized  by  circuits,  as  can  f^\s,  w)  =  (q(n\  y),  the  function  computed  by 
the  FSM  on  n  inputs,  as  suggested  by  Fig.  3.3.  Finally,  the  subfunction  /  is  obtained  by 
fixing  the  appropriate  inputs,  assigning  variable  names  to  the  remaining  inputs,  and  deleting 
the  appropriate  outputs.  ■ 


©John  E  Savage 


3.1  Finite-State  Machines 


97 


Figure  3.4  A  fifteen-state  FSM  that  computes  the  EXCLUSIVE  OR  of  three  inputs  as  a  subfunc- 

/a\ 

tion  of  obtained  by  deleting  all  outputs  except  the  third. 


3.1.4  Interconnections  of  Finite-State  Machines 

Later  in  this  chapter  we  examine  a  family  of  FSMs  characterized  by  a  computational  unit 
connected  to  storage  devices  of  increasing  size.  The  random-access  machine  that  has  a  CPU 
of  small  complexity  and  a  random-access  memory  of  large  but  indeterminate  size  is  of  this 
type.  The  Turing  machine  having  a  fixed  control  unit  that  moves  a  tape  head  over  a  potentially 
infinite  tape  is  another  example. 

This  idea  is  captured  by  the  interconnection  of  synchronous  FSMs.  Synchronous  FSMs 
read  inputs,  advance  from  state  to  state,  and  produce  outputs  in  synchronism.  We  allow  two 
or  more  synchronous  FSMs  to  be  interconnected  so  that  some  outputs  from  one  FSM  are 
supplied  as  inputs  of  another,  as  illustrated  in  Fig.  3.5.  Below  we  generalize  Theorem  3.1.1  to 
a  pair  of  synchronous  FSMs.  We  model  random-access  machines  and  Turing  machines  in  this 
fashion  when  each  uses  a  finite  amount  of  storage. 

THEOREM  3. 1 .3  Let  /©x  M  be  a  function  computed  in  T  steps  by  a  pair  of  interconnected  syn¬ 
chronous  FSMs,  =  (Ei,  'Ll,  Q\,  Si,  Ai,  si,  Ff)  and  M2  =  (S2,  dU,  Qi>  d2,  A2,  s2,  Ff). 


Output  Output 


Figure  3.5  The  interconnection  of  two  finite-state  machines  in  which  one  of  the  three  outputs 
of  M\  is  supplied  as  an  input  to  M2  and  two  of  the  three  outputs  of  M2  are  supplied  to  M\  as 
inputs. 
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Figure  3.6  A  circuit  simulating  T  steps  of  the  two  synchronous  interconnected  FSMs  shown 
in  Fig.  3.5-  The  top  row  of  circuits  simulates  a  T-step  computation  by  M\  and  the  bottom  row 
simulates  a  T-step  computation  by  M2.  One  of  the  three  outputs  of  M\  is  supplied  as  an  input 
to  M2  and  two  of  the  three  outputs  of  M2  are  supplied  to  M\  as  inputs.  The  states  of  M 1  on  the 
initial  and  T  successive  steps  are  qo ,  <h , . .  . ,  qr .  Those  of  M2  are  po ,  P\ ,  ■  ■  ■ ,  Pt  ■ 


Let  Cq(5,  A)  and  Dq(5,  A)  be  the  size  and  depth  of  encodings  of  the  next-state  and  output  func¬ 
tions.  Then,  the  circuit  size  and  depth  over  the  basis  LI  of  any  function  f  computed  by  the  pair 

(T) 

M\  X  M2  in  T  steps  (that  is,  a  subfunction  of  satisfy  the  following  inequalities: 

Cn(f)<T[Cn(Sufy)  +  Cn(S2,X2)\ 

Dn{f)  <  T[m&x(Dn(5u  M),  Dq(82,  A2))] 

Proof  The  construction  that  leads  to  this  result  is  suggested  by  Fig.  3.6.  We  unwind  both 
FSMs  and  connect  the  appropriate  outputs  from  one  to  the  other  to  produce  a  circuit  that 
computes  fMt  XM2  ■  Observe  that  the  number  of  gates  in  the  simulated  circuit  is  T  times  the 
sum  of  the  number  of  gates,  whereas  the  depth  is  T  times  the  depth  of  the  deeper  circuit.  ■ 


3.1.5  Nondeterministic  Finite-State  Machines 

The  finite-state  machine  model  described  above  is  called  a  deterministic  FSM  (DFSM)  be¬ 
cause,  given  a  current  state  and  an  input,  the  next  state  of  the  FSM  is  uniquely  determined. 
A  potentially  more  general  FSM  model  is  the  nondeterministic  FSM  (NFSM)  characterized 
by  the  possibility  that  several  next  states  can  be  reached  from  the  current  state  for  some  given 
input  letter. 

One  might  ask  if  such  a  model  has  any  use,  especially  since  to  the  untrained  eye  a  non¬ 
deterministic  machine  would  appear  to  be  a  dysfunctional  deterministic  one.  The  value  of  an 
NFSM  is  that  it  may  recognize  languages  with  fewer  states  and  in  less  time  than  needed  by  a 
DFSM.  The  concept  of  nondeterminism  will  be  extended  later  to  the  Turing  machine,  where 
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it  is  used  to  classify  languages  in  terms  of  the  time  and  space  they  need  for  recognition.  For 
example,  it  will  be  used  to  identify  the  class  NP  of  languages  that  are  recognized  by  nondeter- 
ministic  Turing  machines  in  a  number  of  steps  that  is  polynomial  in  the  length  of  their  inputs. 
(See  Section  3.9.6.)  Many  important  combinatorial  problems,  such  as  the  traveling  salesperson 
problem,  fall  into  this  class. 

The  formal  definition  of  the  NFSM  is  given  in  Section  4. 1 ,  where  the  next-state  function 
6  :  Q  x  £  i— >  Q  of  the  FSM  is  replaced  by  a  next-state  function  S  :  Q  x  £  i— >  2® .  Such 
functions  assign  to  each  state  q  and  input  letter  a  a  subset  5(q,  a)  of  the  set  Q  of  states  of  the 
NFSM  (2*3,  the  power  set,  is  the  set  of  all  subsets  of  Q.  It  is  introduced  in  Section  1.2.1.) 
Since  the  value  of  S(q,  a)  can  be  the  empty  set,  there  may  be  no  successor  to  the  state  q  on 
input  a.  Also,  since  5(q,  a)  when  viewed  as  a  set  can  contain  more  than  one  element,  a  state 
q  can  have  edges  labeled  a  to  several  other  states.  Since  a  DFSM  has  a  single  successor  to  each 
state  on  every  input,  a  DFSM  is  an  NFSM  in  which  S(q,  a)  is  a  singleton  set. 

While  a  DFSM  M  accepts  a  string  w  if  w  causes  M  to  move  from  the  initial  state  to  a 
final  state  in  F,  an  NFSM  accepts  w  if  there  is  some  set  of  next-state  choices  for  w  that  causes 
M  to  move  from  the  initial  state  to  a  final  state  in  F. 

An  NFSM  can  be  viewed  as  a  purely  deterministic  finite-state  machine  that  has  two  inputs, 
as  suggested  in  Fig.  3.7.  The  first,  the  standard  input,  a,  accepts  the  user’s  data.  The  second, 
the  choice  input,  c,  is  used  to  choose  a  successor  state  when  there  is  more  than  one.  The  in¬ 
formation  provided  via  the  choice  input  is  not  under  the  control  of  the  user  supplying  data  via 
the  standard  input.  As  a  consequence,  the  machine  is  nondeterministic  from  the  point  of  view 
of  the  user  but  fully  deterministic  to  an  outside  observer.  It  is  assumed  that  the  choice  agent 
supplies  the  choice  input  and,  with  full  knowledge  of  the  input  to  be  provided  by  the  user, 
chooses  state  transitions  that,  if  possible,  lead  to  acceptance  of  the  user  input.  On  the  other 
hand,  the  choice  agent  cannot  force  the  machine  to  accept  inputs  for  which  it  is  not  designed. 

In  an  NFSM  it  is  not  required  that  a  state  q  have  a  successor  for  each  value  of  the  standard 
and  choice  inputs.  This  possibility  is  captured  by  allowing  5(q,  a,  c)  to  have  no  value,  denoted 
by  5(q,  a,  c)  =  _L. 

Figure  3.8  shows  an  NFSM  that  recognizes  strings  over  B*  that  end  in  00101.  In  this 
figure  parentheses  surround  the  choice  input  when  its  value  is  needed  to  decide  the  next  state. 
In  this  machine  the  choice  input  is  set  to  1  when  the  choice  agent  knows  that  the  user  is  about 
to  supply  the  suffix  00101. 


Output 


Standard  Input 


Choice  Input 


Figure  3.7  A  nondeterministic  finite-state  machine  modeled  as  a  deterministic  one  that  has  a 
second  choice  input  whose  value  disambiguates  the  value  of  the  next  state. 
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0(0), 1 


Figure  3.8  A  nondeterministic  FSM  that  accepts  binary  strings  ending  in  00101.  Choice 
inputs  are  shown  in  parentheses  for  those  user  inputs  for  which  the  value  of  choice  inputs  can 
disambiguate  next-state  moves. 


0(0), 1(0) 


Figure  3.9  An  example  of  an  NFSM  whose  choice  agent  (its  values  are  in  parentheses)  accepts 
not  only  strings  in  a  language  L ,  but  all  strings. 


Although  we  use  the  anthropomorphic  phrase  “choice  agent,”  it  is  important  to  note  that 
this  choice  agent  cannot  freely  decide  which  strings  to  accept  and  which  not.  Instead,  it  must 
when  possible  make  choices  leading  to  acceptance.  Consider,  for  example,  the  machine  in 
Fig.  3.9.  It  would  appear  that  its  choice  agent  can  accept  strings  in  an  arbitrary  language  L.  In 
fact,  the  language  that  it  accepts  contains  all  strings. 

Given  a  string  w  in  the  language  L  accepted  by  an  NFSM,  a  choice  string  that  leads  to  its 
acceptance  is  said  to  be  a  succinct  certificate  for  its  membership  in  L. 

It  is  important  to  note  that  the  nondeterministic  finite-state  machine  is  not  a  model  of 
reality,  but  is  used  instead  primarily  to  classify  languages.  In  Section  4.1  we  explore  the 
language-recognition  capability  of  the  deterministic  and  nondeterministic  finite-state  machines 
and  show  that  they  are  the  same.  However,  the  situation  is  not  so  clear  with  regard  to  Turing 
machines  that  have  access  to  unlimited  storage  capacity.  In  this  case,  we  do  not  know  whether 
or  not  the  set  of  languages  accepted  in  polynomial  time  on  deterministic  Turing  machines  (the 
class  P)  is  the  same  set  of  languages  that  is  accepted  in  polynomial  time  by  nondeterministic 
Turing  machines  (the  class  NP) . 

3.2  Simulating  FSMs  with  Shallow  Circuits* 

In  Section  3.1  we  demonstrated  that  every  T-step  FSM  computation  can  be  simulated  by 
a  circuit  whose  size  and  depth  are  both  0(T).  In  this  section  we  show  that  every  T-step 
finite-state  machine  computation  can  be  simulated  by  a  circuit  whose  size  and  depth  are  0(T) 
and  O(logT),  respectively.  While  this  seems  a  serious  improvement  in  the  depth  bound,  the 
coefficients  hidden  in  the  big-O  notation  for  both  bounds  depend  on  the  number  of  states  of 
the  FSM  and  can  be  very  large.  Nevertheless,  for  simple  problems,  such  as  binary  addition,  the 
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Figure  3.10  A  finite-state  machine  that  adds  two  binary  numbers.  Their  two  least  significant 
bits  are  supplied  first  followed  by  those  of  increasing  significance.  The  output  bits  represent  the 
sum  of  the  two  numbers. 


results  of  this  section  can  be  useful.  We  illustrate  this  here  for  binary  addition  by  exhibiting 
small  and  shallow  circuits  for  the  adder  FSM  of  Fig.  3.10.  The  circuit  simulation  for  this 
FSM  produces  the  carry-lookahead  adder  circuit  of  Section  2.7.  In  this  section  we  use  matrix 
multiplication,  which  is  covered  in  Chapter  6. 

The  new  method  is  based  on  the  representation  of  the  function  /©  :  QxT,t  i— >  QxfJ 
computed  in  T  steps  by  an  FSM  M  =  (E,  T,  Q,  5,  A,  s,  F)  in  terms  of  the  set  of  state-to- 
state  mappings  S  =  {h  :  Q  Q}  where  S  contains  the  mappings  {A^  :  Q  >  Q  \  x  £  E} 
and  A*  is  defined  below. 

&x(q)  =  S(q,x)  (3.1) 

That  is,  (d)  is  the  state  to  which  state  q  is  carried  by  the  input  letter  x. 

The  FSM  shown  in  Fig.  3.10  adds  two  binary  numbers  sequentially  by  simulating  a  ripple 
adder.  (See  Section  2.7.)  Its  input  alphabet  is  B2,  that  is,  the  set  of  pairs  of  0’s  and  Is.  Its 
output  alphabet  is  B  and  its  state  set  is  Q  —  {9o.  9i>  92. 93}-  (A  sequential  circuit  for  this 
machine  is  designed  in  Section  3.3.)  It  has  the  state-to-state  mappings  shown  in  Fig.  3.11. 

Let  0  :  S2  i— >  S  be  the  operator  defined  on  the  set  S  of  state-to-state  mappings  where  for 
arbitrary  h\,  h2  €  S  and  state  q  £  Q  the  operator  0  is  defined  as  follows: 

(hi  0  h2)(q)  =  h2(hi(q))  (3.2) 


q 

^o,o(<?) 

q 

© 

<l 

9 

^1,0(9) 
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Au(9) 

qo 

qo 
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9o 
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q\ 

qo 

Qi 

9i 

9i 

9i 

9i 

92 

qi 

Qi 

qi 

92 

92 

92 

92 

93 

<& 

Qi 

93 

92 

93 

92 

93 

93 

Figure  3. 1  I  The  state-to-state  mappings  associated  with  the  FSM  of  Fig.  3.10. 
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The  state- to-state  mappings  in  S  will  be  obtained  by  composing  the  mappings  {Ax  :  Q  > 
Q  |  X  £  £}  using  this  operator. 

Below  we  show  that  the  operator  0  is  associative,  that  is,  0  satisfies  the  property  ( h\  0 
h2)  ©  h3  =  hi  0  (h2  ©  h 3).  This  means  that  for  each  q  £  Q,  ((hi  0  h2)  0  h$)(q)  = 
(hi  ©  (h2  ©  h3))(q)  =  h3(h2(hi(q))).  Applying  the  definition  of  0  in  Equation  (3.2),  we 
have  the  following  for  each  q  £  Q: 

((hi  0  h2)  ©  h3)(q)  =  h3((hi  ©  h2)(q)) 

=  h3(h2(hi  (q)))  (3.3) 

=  (h2  ©  h3)(h\(q)) 

=  (hy  ©  (h2  0  h3))(q) 

Thus,  ©  is  associative  and  (S,  0)  is  a  semigroup.  (See  Section  2.6.)  It  follows  that  a  prefix 
computation  can  be  done  on  a  sequence  of  state-to-state  mappings. 

We  now  use  this  observation  to  construct  a  shallow  circuit  for  the  function  ■  Let  w  = 
(w\,w2, . . . ,  wt)  be  a  sequence  of  T  inputs  to  M  where  Wj  is  supplied  on  the  jth  step.  Let 
be  the  state  of  M  after  receiving  the  jth  input.  From  the  definition  of  0  it  follows  that 
t/'L  has  the  following  value  where  s  is  the  initial  state  of  M: 

qU)  =  (A*,,  0ATOi0.-0  Au,.)(s) 

The  value  of  on  initial  state  s  and  T  inputs  can  be  represented  in  terms  of  q  =  (gL) , . . . ; 
<?(T))  as  follows: 

/!?(*,  =  (q(n\X(q(l)),  A (g(2)),  •  •  • ,  A(g(T))) 

Let  be  the  following  sequence  of  state-to-state  mappings: 

A(T)  =  (AU,1,AU!2,...,AWT) 

It  follows  that  q  can  be  obtained  by  computing  the  state-to-state  mappings  Ami  ©  A^  0  •  ■  ■  0 
AWj ,  1  <  j  <  T,  and  applying  them  to  the  initial  state  s.  Because  0  is  associative,  these  T 
state-to-state  mappings  are  produced  by  the  prefix  operator  V^'1  on  the  sequence  A ^  (see 
Theorem  2.6.1): 

V'g ’(ALO)  =  (Affi„(Arai  ©  AW2), (AWI  ©  Aw:l  ©  . . .  ©  Awr)) 

Restating  Theorem  2.6. 1  for  this  problem,  we  have  the  following  result. 

THEOREM  3.2. 1  ForT  =  2k,  k  an  integer ;  the  T  state-to-state  mappings  defined  hy  the  T  inputs 
to  an  FSM  M  can  be  computed  by  a  circuit  over  the  basis  Q  =  {©}  whose  size  and  depth  satisfy 
the  following  bounds: 


Cn(vg])  <  2T  —  log2  T  —  2 
Dn  (t?0T))  <21og2T 
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(T') 

The  construction  of  a  shallow  Boolean  circuit  for  1  is  reduced  to  a  five-step  problem:  1) 
for  each  input  letter  x  design  a  circuit  whose  input  and  output  are  representations  of  states  and 
which  defines  the  state-to-state  mapping  Ax  for  input  letter  x ;  2)  construct  a  circuit  for  the 
associative  operator  ©  that  accepts  the  representations  of  two  state-to-state  mappings  Ay  and 
Az  and  produces  a  representation  for  the  state-to-state  mapping  Ay  0  A_;  3)  use  the  circuit 
for  ©  in  a  parallel  prefix  circuit  to  produce  the  T  state-to-state  mappings;  4)  construct  a  circuit 
that  combines  the  representation  of  the  initial  state  s  with  that  of  the  state-to-state  mapping 
AWl  ©  AW2  ©  •  ■  ■  ©  Aw,  to  obtain  a  representation  for  the  successor  state  AWI  ©  AW2  ©  •  ■  ■  0 
A,,,  ,  (s);  and  5)  construct  a  circuit  for  A  that  computes  an  output  from  the  representation  of  a 
state. 

We  now  describe  a  generic,  though  not  necessarily  efficient,  implementation  of  these  steps. 

Let  Q  =  {go>  <Zi,  ■  •  ■ ,  Q|Q|-i}  be  the  states  of  M.  The  state-to-state  mapping  Ax  for  the 
FSM  M  needed  for  the  first  step  can  be  represented  by  a  |Q|  x  |Q|  Boolean  matrix  N(x)  = 
{riij(x)}  in  which  the  entry  in  row  i  and  column  j,  riij(x),  satisfies 

1  if  M  moves  from  state  qi  to  state  qj  on  input  x 
ni’i0r)=l  0  otherwise 


Consider  again  the  FSM  shown  in  Fig.  3.10.  The  matrices  associated  with  its  four  pairs  of 
inputs  x  £  {(0,  0),  (0, 1),  (1,  0),  (1, 1)}  are  shown  below,  where  N(( 0,  1))  =  iV((l,  0)): 


N((  0,0)) 


"  1 

0 

0 

0  ■ 

'  0 

1 

0 

0  ' 

1 

0 

0 

0 

m  0,1))  = 

0 

1 

0 

0 

0 

1 

0 

0 

0 

0 

1 

0 

.  0 

1 

0 

0  . 

.  0 

0 

1 

0  . 

N((  1,1)) 


'0010' 
0  0  10 
0  0  0  1 
0  0  0  1 


From  these  matrices  the  generic  matrix  N((u,  v))  parameterized  by  the  values  of  the  inputs  (a 
pair  (u,  v)  in  this  example)  is  produced  from  the  following  Boolean  functions:  t  =  u  A  v,  the 

carry-terminate  function,  p  =  u  ©  v,  the  carry-propagate  function,  and  g  =  u  A  v,  the 
carry-generate  function. 


N((u,  v)) 


t  p  g  0  " 

t  p  g  0 

0  t  p  g 

0  t  p  g  _ 


Let  a(i)  =  (0,  0, . . . ,  0,  1, 0, . .  .  0)  be  the  unit  |Q|-vector  that  has  value  1  in  the  ith  position 
and  zeros  elsewhere.  Let  <j(i)N( x)  denote  Boolean  vector-matrix  multiplication  in  which  ad¬ 
dition  is  OR  and  multiplication  is  AND.  Then,  for  each  i,  a(i)N(x)  =  (n^i,  n ^2,  •  ■  ■ ,  tL,|Q|) 
is  the  unit  vector  denoting  the  state  that  M  enters  when  it  is  in  state  qi  and  receives  input  x. 
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Let  N(x,  y)  =  N(x)  X  N(y)  be  the  Boolean  matrix-matrix  multiplication  of  matrices  N(x) 
and  N(y)  in  which  addition  is  OR  and  multiplication  is  AND.  Then,  for  each  x  and  y  the  entry 
in  row  i  and  column  j  of  N(x)  x  N(y),  namely  (x,  y),  satisfies  the  following  identity: 

Kj{x’V)=  V  niAx)  ■ 

qt£Q 

(2) 

That  is,  n\j  ( x ,  y)  =  1  if  there  is  a  state  qt  €  Q  such  that  in  state  M  is  given  input  x, 
moves  to  state  qt,  and  then  moves  to  state  qj  on  input  y.  Thus,  the  composition  operator  0 
can  be  realized  through  the  multiplication  of  Boolean  matrices.  It  is  straightforward  to  show 
that  matrix  multiplication  is  associative.  (See  Problem  3.10.) 

Since  matrix  multiplication  is  associative,  a  prefix  computation  using  matrix  multiplica¬ 
tion  as  a  composition  operator  for  each  prefix  =  (x\,  X2,  ■  ■  ■ ,  Xj)  of  the  input  string  x 
generates  a  matrix  N (xS^)  =  N{x  1)  x  N( X2)  X  •  •  •  X  N(xj)  defining  the  state-to-state 
mapping  associated  with  for  each  value  of  1  <  j  <  n. 

The  fourth  step,  the  application  of  a  sequence  of  state-to-state  mappings  to  the  initial  state 
s  =  qr,  represented  by  the  |Q|-vector  er(r),  is  obtained  through  the  vector-matrix  multiplica¬ 
tion  <j{r)N(x for  1  <  j  <  n. 

The  fifth  step  involves  the  computation  of  the  output  word  from  the  current  state.  Let 
the  column  |Q|-vector  A  contain  in  the  ith  position  the  output  of  the  FSM  M  when  in  state 
qt.  Then,  the  output  produced  by  the  FSM  after  the  jth  input  is  the  product  a(r)N(x^)  A. 
This  result  is  summarized  below. 

THEOREM  3.2.2  Let  the  finite-state  machine  M  =  (E,  'L,  Q,  5,  A,  s,  F )  with  \Q\  states  compute 
a  subfunction  f  of  f^p  in  T  steps.  Then  f  has  the  following  size  and  depth  bounds  over  the 
standard  basis  LIq  for  some  «>1: 

Cn0(f)  =  O(Mmatlix(\Q\,K)T) 

Dn0{f)  =  O  ((/clog  |Q|)(logT)) 

Here  Mmatrix(^,  k)  is  the  size  of  a  circuit  to  multiply  two  n  X  n  matrices  with  a  circuit  of  depth 
k  log  n.  These  bounds  can  be  achieved  simultaneously. 

Proof  The  circuits  realizing  the  Boolean  functions  {riij(x)  \  1  <  i,j  <  \Q\},  x  an 
input,  each  have  a  size  determined  by  the  size  of  the  input  alphabet  E,  which  is  constant. 
The  number  of  operations  required  to  multiply  two  Boolean  matrices  with  a  circuit  of  depth 
«Tog  \Q\,  k  >  1 ,  is  Afmatrix( | Q | ,  ac) .  (See  Section  6.3.  Note  that  Mmatrix(|Q|,  k)  <  |Q|3.) 
Finally,  the  prefix  circuit  uses  0(T)  copies  of  the  matrix  multiplication  circuit  and  has  a 
depth  of  O(logT)  copies  of  the  matrix  multiplication  circuit  along  the  longest  path.  (See 
Section  2.6.)  ■ 

When  an  FSM  has  a  large  number  of  states  but  its  next-state  function  is  relatively  simple, 
that  is,  it  has  a  size  that  is  at  worst  a  polynomial  in  log  \Q\,  the  above  size  bound  will  be  much 
larger  than  the  size  bound  given  in  Theorem  3.1.1  because  Mmatrix(n,  n)  grows  exponentially 
in  log  \  Q\.  The  depth  bound  grows  linearly  with  log  \Q\  whereas  the  depth  of  the  next- 
state  function  on  which  the  depth  bound  of  Theorem  3.1.1  depends  will  typically  grow  either 
linearly  or  as  a  small  polynomial  in  log  log  \Q\  for  an  FSM  with  a  relatively  simple  next-state 
function.  Thus,  the  depth  bound  will  be  smaller  than  that  of  Theorem  3.1.1  for  very  large 
values  of  T,  but  for  smaller  values,  the  latter  bound  will  dominate. 
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3.2.1  A  Shallow  Circuit  Simulating  Addition 

Applying  the  above  result  to  the  adder  FSM  of  Fig.  3.10,  we  produce  a  circuit  that  accepts 
T  pairs  of  binary  inputs  and  computes  the  sum  as  T-bit  binary  numbers.  Since  this  FSM 
has  four  states,  the  theorem  states  that  the  circuit  has  size  0(T)  and  depth  O(logT).  The 
carry-lookahead  adder  of  Section  2.7  has  these  characteristics. 

We  can  actually  produce  the  carry-lookahead  circuit  by  a  more  careful  design  of  the  state- 
to-state  mappings.  We  use  the  following  encodings  for  states,  where  states  are  represented  by 
pairs  {(c,  s)}. 


State  Encoding 


q 

c 

S 

Qo 

0 

0 

0 

1 

<12 

1 

0 

<13 

1 

1 

Since  the  next-state  mappings  are  the  same  for  inputs  0,  1,  and  1,0,  we  encode  an  input 
pair  (u,v)  by  ( g,p ),  where  g  =  u  At)  and  p  =  u  ®  v  are  the  carry-generate  and  carry- 
propagate  variables  introduced  in  Section  2.7  and  used  above.  With  these  encodings,  the  three 
different  next-state  mappings  {A0jo,  Ao,i,  Aj  i  }  defined  in  Fig.  3.11  can  be  encoded  as  shown 
in  the  table  below.  The  entry  at  the  intersection  of  row  (c,  s)  and  column  ( p ,  g)  in  this  table 
is  the  value  (c*,  s*)  of  the  generic  next-state  function  (c*,  s*)  =  A p,g(c,  s ).  (Here  we  abuse 
notation  slightly  to  let  APiS  denote  the  state-to-state  mapping  associated  with  the  pair  (it,  v) 
and  represent  the  state  q  of  M  by  the  pair  (c,  s).) 
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Inspection  of  this  table  shows  that  we  can  write  the  following  formulas  for  c*  and  s *: 

c*  =  (p  A  c)  V  g,  s*  =  p®  c 

Consider  two  successive  input  pairs  and  (u2>  ^2)  and  associated  pairs  (pi,<?i)  and 

{pi^gi)-  If  the  FSM  of  Fig.  3.10  is  in  state  (co,  So)  and  receives  input  (mi,  V\),  it  enters  the 
state  (ci,  Si)  =  (p  1  A  Co  V  <?i, p\  ®  Co).  This  new  state  can  be  obtained  by  combining p\  and 
<?i  with  Co-  Let  ©,  S2)  be  the  successor  state  when  the  mapping  AP2j£)2  is  applied  to  (ci,  Si). 
The  effect  of  the  operator  0  on  successive  state-to-state  mappings  APligi  and  AP2>g2  is  shown 
below,  in  which  (3.2)  is  used: 

(^pi.91  ©  Ap2jg2)(<7)  =  Ap2jg2(ApliSl  ((co,  So))) 

=  AP2i92(pi  A  c0  V  gupi  0  c0) 
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=  (P2  a  (pi  A  Co  V  Ch)  V  g2,P2  0  (pi  A  c0  V  g ;)) 

=  (O2  Api)  A  c0  V  (g2  M  p2  A5!)),p2  0  Oi  A  c0  V  gi)) 

=  (c2,  s2 ) 

It  follows  that  C2  can  be  computed  from  p*  =  p2  A  p\  and  g*  =  g2  V  P2  A  gi  and  Cq.  The 
value  of  S2  is  obtained  fronts  and  Ci.  Thus  the  mapping  A.Pugi  ©A  P2<g2  is  defined  by  p*  and 
g* ,  quantities  obtained  by  combining  the  pairs  (pi,gi )  and  {p2,  g2)  using  the  same  associative 
operator  o  defined  for  the  carry-lookahead  adder  in  Section  2.7.1. 

To  summarize,  the  state-to-state  mappings  corresponding  to  subsequences  of  an  input 
string  ((uo,  Vo),  (u i,  Ci), . . . ,  (ztn_ 2,  vn-2),  L>n-i))  can  be  computed  by  representing 

this  string  by  the  carry-propagate,  carry-generate  string  ((p0,  g0),  (jp\,gi), .  . . ,  (pn- 2. 9n- 2), 
(pn-i,  gn-i)),  computing  the  prefix  operation  on  this  string  using  the  operator  O,  then  com¬ 
puting  Ci  from  Co  and  the  carry-propagate  and  carry-generate  functions  for  the  *th  stage  and  Si 
from  this  carry-propagate  function  and  Cj_  1.  This  leads  to  the  carry-lookahead  adder  circuit 
of  Section  2.7.1. 

3.3  Designing  Sequential  Circuits 

Sequential  circuits  are  concrete  machines  constructed  of  gates  and  binary  memory  devices. 
Given  an  FSM,  a  sequential  machine  can  be  constructed  for  it,  as  we  show. 

A  sequential  circuit  is  constructed  from  a  logic  circuit  and  a  collection  of  clocked  binary 
memory  units,  as  suggested  in  Figs.  3.12(a)  and  3.15.  (Shown  in  Fig.  3.12(a)  is  a  simple 
sequential  circuit  that  computes  the  EXCLUSIVE  OR  of  the  initial  value  in  memory  and  the 
external  input  to  the  sequential  circuit.)  Inputs  to  the  logic  circuit  consist  of  outputs  from  the 
binary  memory  units  as  well  as  external  inputs.  The  outputs  of  the  logic  circuit  serve  as  inputs 
to  the  clocked  binary  memory  units  as  well  as  external  outputs. 

A  clocked  binary  memory  unit  is  driven  by  a  clock,  a  periodic  signal  that  has  value  1  (it  is 
high)  during  short,  uniformly  spaced  time  intervals  and  is  otherwise  0  (it  is  low),  as  suggested 
in  Figs.  3.12(b).  For  correct  operation  it  is  assumed  that  the  input  to  a  memory  unit  does  not 
change  when  the  clock  is  high.  Thus,  the  outputs  of  a  logic  circuit  feeding  the  memory  units 
cannot  change  during  these  intervals.  This  in  turn  requires  that  all  changes  in  the  inputs  to 


M 

Clock 


s 


Clock 


(a) 


(b) 


Figure  3.12  (a)  A  sequential  circuit  with  one  gate  and  one  clocked  memory  unit  computing 
the  EXCLUSIVE  OR  of  its  inputs;  (b)  a  periodic  clock  pattern. 
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this  circuit  be  fully  propagated  to  its  outputs  in  the  intervals  when  the  clock  is  low.  A  circuit 
that  operates  this  way  is  considered  safe.  Designers  of  sequential  circuits  calculate  the  time  for 
signals  to  pass  through  a  logic  circuit  and  set  the  interval  between  clock  pulses  to  insure  that 
the  operation  of  the  sequential  circuit  is  safe. 

Sequential  circuits  are  designed  from  finite-state  machines  (FSMs)  in  a  series  of  steps. 
Consider  an  FSM  M  =  (E,  T,  Q,  5,  A,  s )  with  input  alphabet  E,  output  alphabet  T,  state 
set  Q,  next-state  function  S  :  Q  x  E  i— >  Q,  output  function  A  :  Q  T,  and  initial  state  s. 
(For  this  discussion  we  ignore  the  set  of  final  states;  they  are  important  only  when  discussing 
language  recognition.)  We  illustrate  the  design  of  a  sequential  machine  using  the  FSM  of 
Fig.  3.10,  which  is  repeated  in  Fig.  3.13. 

The  first  step  in  producing  a  sequential  circuit  from  an  FSM  is  to  assign  unique  binary 
tuples  to  each  input  letter,  output  letter,  and  state  (the  state-assignment  problem).  This  is 
illustrated  for  our  FSM  by  the  tables  of  Fig.  3.14  in  which  the  identity  encoding  is  used  on 
inputs  and  outputs.  This  step  can  have  a  large  impact  on  the  size  of  the  logic  circuit  produced. 
Second,  tables  for  5  :  B4  i— >  B2  and  A  :  B2  i— >  B,  the  next-state  and  output  functions  of 
the  FSM,  respectively,  are  produced  from  the  description  of  the  FSM,  as  shown  in  the  same 
figure.  Here  c*  and  s*  represent  the  successor  to  the  state  (c,  s).  Third,  circuits  are  designed 
that  realize  the  binary  functions  associated  with  c*  and  s* .  Fourth  and  finally,  these  circuits  are 
connected  to  clocked  binary  memory  devices,  as  shown  in  Fig.  3.15,  to  produce  a  sequential 
circuit  that  realizes  the  FSM.  We  leave  to  the  reader  the  task  of  demonstrating  that  these  circuits 
compute  the  functions  defined  by  the  tables.  (See  Problem  3.1 1.) 

Since  gates  and  clocked  memory  devices  can  be  constructed  from  semiconductor  materials, 
a  sequential  circuit  can  be  assembled  from  physical  components  by  someone  skilled  in  the  use 
of  this  technology.  We  design  sequential  circuits  in  this  book  to  obtain  upper  bounds  on  the 
size  and  depth  of  the  next-state  and  output  functions  of  a  sequential  machine  so  that  we  can 
derive  computational  inequalities. 


Figure  3.13  A  finite-state  machine  that  simulates  the  ripple  adder  of  Fig.  2.14.  It  is  in  state  qr 
if  the  carry-and-sum  pair  (cj+ i,  Sj)  generated  by  the  jth  full  adder  of  the  ripple  adder  represents 
the  integer  r,  0  <  r  <  3.  The  output  produced  is  the  sum  bit. 


108 


Chapter  3  Machines  with  Memory 


Models  of  Computation 


Input  Encoding  Output  Encoding  State  Encoding 


(TfS 

U 

V 

A (q)  <E 

Kq) 

q 

C 

S 

0  0 

0 

0 

0 

0 

qo 

0 

0 

0  1 

0 

1 

1 

1 

q\ 

0 

i 

1  0 

1 

0 

qi 

i 

0 

1  1 

1 

1 

<73 

i 

1 

(5 

:  B4  ^ 

B2 

A  : 

B2  i- 

B 

C 

S 

U 

V 

c* 

s* 

c* 

s* 

s 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

l 

0 

0 

0 

0 

0 

1 

i 

1 

0 

0 

0 

0 

1 

1 

0 

0 

1 

1 

0 

0 

0 

1 

1 

1 

i 

0 

0 

0 

1 

0 

1 

0 

1 

0 

1 

0 

1 

1 

0 

0 

1 

1 

0 

1 

1 

0 

1 

1 

0 

0 

0 

1 

0 

0 

1 

0 

1 

1 

0 

0 

1 

1 

0 

1 

0 

1 

0 

1 

1 

1 

0 

1 

0 

0 

0 

1 

1 

1 

0 

0 

1 

1 

1 

1 

0 

1 

0 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

Figure  3. 14  Encodings  for  inputs,  outputs,  states,  and  the  next-state  and  output  functions  of 
the  FSM  adder. 


Figure  3.15  A  sequential  circuit  for  the  FSM  that  adds  binary  numbers. 
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3.3.1  Binary  Memory  Devices 

It  is  useful  to  fix  ideas  about  memory  units  by  designing  one  (a  latch)  from  logic  gates.  We 
use  two  latchs  to  create  a  flip-flop,  the  standard  binary  storage  device.  A  collection  of  clocked 
flip-flops  is  called  a  register.  A  clocked  latch  can  be  constructed  from  a  few  AND  and  NOT 
gates,  as  shown  in  Fig.  3.16(a).  The  NAND  gates  (they  compute  NOT  of  AND)  labeled  773  and 
774  form  the  heart  of  the  latch.  Consider  the  inputs  to  <73  and  174,  the  lines  connected  to  the 
outputs  of  NAND  gates  <71  and  <72-  If  one  is  set  to  1  and  the  other  reset  to  0,  after  all  signals 
settle  down,  p  and  p*  will  assume  complementary  values  (one  will  have  value  1  and  the  other 
will  have  value  0),  regardless  of  their  previous  values.  The  gate  with  input  1  will  assume  output 
0  and  vice  versa. 

Now  if  the  outputs  of  771  and  772  are  both  set  to  1  and  the  values  previously  assumed  by  p 
and  p*  are  complementary,  these  values  will  be  retained  due  to  the  feedback  between  <73  and 
774,  as  the  reader  can  verify.  Since  the  outputs  of  771  and  772  are  both  1  when  the  clock  input 
(CLK  in  Fig.  3.16)  has  value  0,  the  complementary  outputs  of  773  and  774  remain  unchanged 
when  the  clock  is  low.  Since  the  outputs  of  a  latch  provide  inputs  to  the  logic-circuit  portion 
of  a  sequential  circuit,  it  is  important  that  the  latch  outputs  remain  constant  when  the  clock 
is  low. 

When  the  clock  input  is  1 ,  the  outputs  of  771  and  772  are  S  and  R,  the  Boolean  complements 
of  S  and  R.  If  S  and  R  are  complementary,  as  is  true  for  this  latch  since  R  =  S,  this  device 
will  store  the  value  of  S  in  p  and  its  complement  in  p* .  Thus,  if  S  =  1,  the  latch  is  set  to  1, 
whereas  if  R  =  1  (and  S  =  0)  it  is  reset  to  0.  This  type  of  device  is  called  a  D-type  latch.  For 
this  reason  we  change  the  name  of  the  external  input  to  this  memory  device  from  S  to  D. 

Because  the  output  of  the  D-type  latch  shown  in  Fig.  3. 16(a)  changes  when  the  clock  pulse 
is  high,  it  cannot  be  used  as  a  stable  input  to  a  logic  circuit  that  feeds  this  or  another  such  flip- 
flop.  Adding  another  stage  like  the  first  but  having  the  complementary  value  for  the  clock 
pulse,  as  shown  in  Fig.  3.16(b),  causes  the  output  of  the  second  stage  to  change  only  while  the 
clock  pulse  is  low.  The  output  of  the  first  stage  does  change  when  the  clock  pulse  is  high  to 
record  the  new  value  of  the  state.  This  is  called  a  master-slave  edge-triggered  flip-flop.  Other 
types  of  flip-flop  are  described  in  texts  on  computer  architecture. 


D  =  S 


Figure  3. 1  6  (a)  Design  of  a  D-type  latch  from  NAND  gates,  (b)  A  master-slave  edge-triggered 
D-type  flip-flop. 


110 


Chapter  3  Machines  with  Memory 


Models  of  Computation 


3.4  Random-Access  Machines 

The  random-access  machine  (RAM)  models  the  essential  features  of  the  traditional  serial 
computer.  The  RAM  is  modeled  by  two  synchronous  interconnected  FSMs,  a  central  process¬ 
ing  unit  (CPU)  and  a  random-access  memory.  (See  Fig.  3.17.)  The  CPU  has  a  small  number 
of  storage  locations  called  registers  whereas  the  random-access  memory  has  a  large  number. 
All  operations  performed  by  the  CPU  are  performed  on  data  stored  in  its  registers.  This  is  done 
for  efficiency;  no  increase  in  functionality  is  obtained  by  allowing  operations  on  data  stored  in 
memory  locations  as  well. 

3.4.1  The  RAM  Architecture 

The  CPU  implements  a  fetch-and-execute  cycle  in  which  it  alternately  reads  an  instruction 
from  a  program  stored  in  the  random-access  memory  (the  stored-program  concept)  and  ex¬ 
ecutes  it.  Instructions  are  read  and  executed  from  consecutive  locations  in  the  random-access 
memory  unless  a  jump  instruction  is  executed,  in  which  case  an  instruction  from  a  non- 
consecutive  location  is  executed  next. 

A  CPU  typically  has  five  basic  kinds  of  instruction:  a)  arithmetic  and  logical  instructions  of 
the  kind  described  in  Sections  2.5.1,  2.7,  2.9,  and  2.10,  b)  memory  load  and  store  instructions 
for  moving  data  between  memory  locations  and  registers,  c)  jump  instructions  for  breaking 
out  of  the  current  program  sequence,  d)  input  and  output  (I/O)  instructions,  and  e)  a  halt 
instruction. 

The  basic  random-access  memory  has  an  output  word  ( out_wrd )  and  three  input  words, 
an  address  ( addr ),  a  data  word  ( injwrd ),  and  a  command  ( cmd ).  The  command  specifies 
one  of  three  actions,  a)  read  from  a  memory  location,  b)  write  to  a  memory  location,  or  c) 
do  nothing.  Reading  from  address  addr  deposits  the  value  of  the  word  at  this  location  into 
outjwrd  whereas  writing  to  addr  replaces  the  word  at  this  address  with  the  value  of  injwrd. 


Random-Access  Memory 


Figure  3. 1 7  The  random-access  machine  has  a  central  processing  unit  (CPU)  and  a  random- 
access  memory  unit. 


©John  E  Savage 


3.4  Random-Access  Machines 


ill 


This  memory  is  called  random-access  because  the  time  to  access  a  word  is  the  same  for  all 
words.  The  Turing  machine  introduced  in  Section  3.7  has  a  tape  memory  in  which  the  time 
to  access  a  word  increases  with  its  distance  from  the  tape  head. 

The  random-access  memory  in  the  model  in  Fig.  3.17  has  m  =  2^  storage  locations  each 
containing  a  6-bit  word,  where  /t  and  6  are  integers.  Each  word  has  a  ft,- bit  address  and  the 
addresses  are  consecutive  starting  at  zero.  The  combination  of  this  memory  and  the  CPU 
described  above  is  the  bounded-memory  RAM.  When  no  limit  is  placed  on  the  number  and 
size  of  memory  words,  this  combination  defines  the  unbounded-memory  RAM.  We  use  the 
term  RAM  for  these  two  machines  when  context  unambiguously  determines  which  is  intended. 

DESIGN  OF  A  SIMPLE  CPU  The  design  of  a  simple  CPU  is  given  in  Section  3.10.  (See 
Fig.  3.31.)  This  CPU  has  eight  registers,  a  program  counter  (PC),  accumulator  (AC),  mem¬ 
ory  address  register  (MAR),  memory  data  register  (MDR),  operation  code  (opcode)  regis¬ 
ter  (OPC),  input  register  (INR),  output  register  (OUTR),  and  halt  register  (HALT).  Each 
operation  that  requires  two  operands,  such  as  addition  or  vector  AND,  uses  AC  and  MDR  as 
sources  for  the  operands  and  places  the  result  in  AC.  Each  operation  with  one  operand,  such 
as  the  NOT  of  a  vector,  uses  AC  as  both  source  and  destination  for  the  result.  PC  contains  the 
address  of  the  next  instruction  to  be  executed.  Unless  a  jump  instruction  is  executed,  PC  is 
incremented  on  the  execution  of  each  instruction.  If  a  jump  instruction  is  executed,  the  value 
of  PC  is  changed.  Jumps  occur  in  our  simple  CPU  if  AC  is  zero. 

To  fetch  the  next  instruction,  the  CPU  copies  PC  to  MAR  and  then  commands  the 
random-access  memory  to  read  the  word  at  the  address  in  MAR.  This  word  appears  in  MDR. 
The  portion  of  this  word  containing  the  identity  of  the  opcode  is  transferred  to  OPC.  The 
CPU  then  inspects  the  value  of  OPC  and  performs  the  small  local  operations  to  execute  the 
instruction  represented  by  it.  For  example,  to  perform  an  addition  it  commands  the  arith¬ 
metic/logical  unit  (ALU)  to  combine  the  contents  of  MDR  and  AC  in  an  adder  circuit  and 
deposit  the  result  in  AC.  If  the  instruction  is  a  load  accumulator  instruction  (LDA),  the  CPU 
treats  the  bits  other  than  opcode  bits  as  address  bits  and  moves  them  to  the  MAR.  It  then  com¬ 
mands  the  random-access  memory  to  deposit  the  word  at  this  address  in  MDR,  after  which  it 
moves  the  contents  of  MDR  to  AC.  In  Section  3.4.3  we  illustrate  programming  in  an  assembly 
language,  the  language  of  a  machine  enhanced  by  mnemonics  and  labels.  We  further  illustrate 
assembly-language  programming  in  Section  3.10.4  for  the  instruction  set  of  the  machine  de¬ 
signed  in  Section  3.10. 

3.4.2  The  Bounded-Memory  RAM  as  FSM 

As  this  discussion  illustrates,  the  CPU  and  the  random-access  memory  are  both  finite-state 
machines.  The  CPU  receives  input  from  the  random-access  memory  as  well  as  from  external 
sources.  Its  output  is  to  the  memory  and  the  output  port.  Its  state  is  determined  by  the 
contents  of  its  registers.  The  random-access  memory  receives  input  from  and  produces  output 
to  the  CPU.  Its  state  is  represented  by  an  ro-tuple  (wq,  W\, . . . ,  wm-\)  of  6-bit  words,  one 
per  memory  location,  as  well  as  by  the  values  of  iruwrd,  out-word,  and  addr.  We  say  that 
the  random-access  memory  has  a  storage  capacity  of  S  =  mb  bits.  The  RAM  has  input  and 
output  registers  (not  shown  in  Fig.  3.17)  through  which  it  reads  external  inputs  and  produces 
external  outputs. 

As  the  RAM  example  illustrates,  some  FSMs  are  programmable.  In  fact,  a  program  stored 
in  the  RAM  memory  selects  one  of  very  many  state  sequences  that  the  RAM  may  execute.  The 
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number  of  states  of  a  RAM  can  be  very  large;  just  the  random-access  memory  alone  has  more 
than  2s  states. 

The  programmability  of  the  unbounded-memory  RAM  makes  it  universal  for  FSMs,  as 
we  show  in  Section  3.4.4.  Before  taking  up  this  subject,  we  pause  to  introduce  an  assembly- 
language  program  for  the  unbounded-memory  RAM.  This  model  will  play  a  role  in  Chapter  5. 

3.4.3  Unbounded-Memory  RAM  Programs 

We  now  introduce  assembly-language  programs  to  make  concrete  the  use  of  the  RAM.  An 
assembly  language  contains  one  instruction  for  each  machine-level  instruction  of  a  CPU.  How¬ 
ever,  instead  of  bit  patterns,  it  uses  mnemonics  for  opcodes  and  labels  as  symbolic  addresses. 
Labels  are  used  in  jump  instructions. 

Figure  3.18  shows  a  simple  assembly  language.  It  implements  all  the  instructions  of  the 
CPU  defined  in  Section  3.10  and  vice  versa  if  the  CPU  has  a  sufficiently  long  word  length. 

Our  new  assembly  language  treats  all  memory  locations  as  equivalent  and  calls  them  reg¬ 
isters.  Thus,  no  distinction  is  made  between  the  memory  locations  in  the  CPU  and  those 
in  the  random-access  memory.  Such  a  distinction  is  made  on  real  machines  for  efficiency:  it 
is  much  quicker  to  access  registers  internal  to  a  CPU  than  memory  locations  in  an  external 
random-access  memory. 

Registers  are  used  for  data  storage  and  contain  integers.  Register  names  are  drawn  from  the 
set  {Ro,  Ri,  R2,  . .  .}•  The  address  of  register  Rj  is  i.  Thus,  both  the  number  of  registers  and 
their  size  are  potentially  unlimited.  All  registers  are  initialized  with  the  value  zero.  Registers 
used  as  input  registers  to  a  program  are  initialized  to  input  values.  Results  of  a  computation 
are  placed  in  output  registers.  Such  registers  may  also  serve  as  input  registers.  Each  instruc¬ 
tion  may  be  given  a  label  drawn  from  the  set  {No,  iVj,  N2, .  . .}.  Labels  are  used  by  jump 
instructions,  as  explained  below. 


Instruction 

Meaning 

INC  R, 

Increment  the  contents  of  Rj  by  1 . 

DEC  Ri 

Decrement  the  contents  of  Rj  by  1 . 

CLR  Rj 

Replace  the  contents  of  Rj  with  0. 

Ri  <-  Rj 

Replace  the  contents  of  Rj  with  those  of  Rj . 

JMP+  N, 

Jump  to  closest  instruction  above  current  one  with  label  Nj. 

JMP_  Nj 

Jump  to  closest  instruction  below  current  one  with  label  Nj. 

R j  JMP+  Nj 

If  Rj  contains  0,  jump  to  closest  instruction  above 
current  one  with  label  Nj. 

Rj  JMP_  Nj 

If  Rj  contains  0,  jump  to  closest  instruction  below 
current  one  with  label  Nj . 

CONTINUE 

Continue  to  next  instruction;  halt  if  none. 

Figure  3.18  The  instructions  in  a  simple  assembly  language. 
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The  meaning  of  each  instruction  should  be  clear  except  possibly  for  the  CONTINUE  and 
JUMP.  If  the  program  reaches  a  CONTINUE  statement  other  than  the  last  CONTINUE,  it 
executes  the  following  instruction.  If  it  reaches  the  last  CONTINUE  statement,  the  program 
halts. 

The  jump  instructions  Rj  JMP+  N,;,  Rj  JMP_  Nj,  JMP+  N,;,  and  JMP_  Nj  cause  a 
break  in  the  program  sequence.  Instead  of  executing  the  next  instruction  in  sequence,  they 
cause  jumps  to  instructions  with  labels  Nj.  In  the  first  two  cases  these  jumps  occur  only  when 
the  content  of  register  Rj  is  zero.  In  the  last  two  cases,  these  jumps  occur  unconditionally. 
The  instructions  with  JMP+  (JMP_)  cause  a  jump  to  the  closest  instruction  with  label  Nj 
above  (below)  the  current  instruction.  The  use  of  the  suffixes  +  and  —  permit  the  insertion  of 
program  fragments  into  an  existing  program  without  relabeling  instructions. 

A  RAM  program  is  a  finite  sequence  of  assembly  language  instructions  terminated  with 
CONTINUE.  A  valid  program  is  one  for  which  each  jump  is  to  an  existing  label.  A  halting 
program  is  one  that  halts. 

TWO  RAM  PROGRAMS  We  illustrate  this  assembly  language  with  the  two  simple  programs 
shown  in  Fig.  3.19.  The  first  adds  two  numbers  and  the  second  uses  the  first  to  square  a 
number.  The  heading  of  each  program  explains  its  operation.  Registers  Ro  and  Ri  contain  the 
initial  values  on  which  the  addition  program  operates.  On  each  step  it  increments  Ro  by  1  and 
decrements  Ri  by  1  until  Ri  is  0.  Thus,  on  completion,  the  value  of  Ro  is  its  original  value 
plus  the  value  of  Ri  and  Ri  contains  0. 

The  squaring  program  uses  the  addition  program.  It  makes  three  copies  of  the  initial  value 
x  of  Ro  and  stores  them  in  Ri,  R2,  and  R3.  It  also  clears  Ro-  R2  will  be  used  to  reset  Ri  to  x 
after  adding  Ri  to  Ro-  R3  is  used  as  a  counter  and  decremented  x  times,  after  which  x  is  added 
to  zero  x  times  in  Rq;  that  is,  x2  is  computed. 


Ro  <—  Ro  +  Ri 

Comments 

Ro  ^  Ro 

Comments 

No  Ri  JMP_  Nj 

End  if  R,  =  0 

R2  <—  Ro 

Copy  R0  (x)  to  R2 

INCRo 

Increment  Ro 

R3  Ro 

Copy  Ro  (x)  to  R3 

DEC  R! 

Decrement  Ri 

CLR  R0 

Clear  the  contents  of  Ro 

JMP+  N0 

Repeat 

N2  Ri  < —  R2 

Copy  R2  (x)  to  Ri 

Ni  CONTINUE 

N0  Ri  JMP_  Nj 

Ro  <—  Ro  +  Ri 

DEC  Ri 
JMP+  N0 
Nj  CONTINUE 

DEC  R3  Decrement  R3 

R3  JMP_  N3  End  when  zero 
JMP+  N2  Add  a;  to  Ro 

N3  CONTINUE 


Figure  3.19  Two  simple  RAM  programs.  The  first  adds  two  integers  stored  initially  in  registers 
Ro  and  Ri ,  leaving  the  result  in  Ro .  The  second  uses  the  first  to  square  the  contents  of  Ro ,  leaving 
the  result  in  Rq. 
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As  indicated  above,  with  large  enough  words  each  of  the  above  assembly-language  instruc¬ 
tions  can  be  realized  with  a  few  instructions  from  the  instruction  set  of  the  CPU  designed  in 
Section  3.10.  It  is  also  true  that  each  of  these  CPU  instructions  can  be  implemented  by  a 
fixed  number  of  instructions  in  the  above  assembly  language.  That  is,  with  sufficiently  long 
memory  words  in  the  CPU  and  random-access  memory,  the  two  languages  allow  the  same 
computations  with  about  the  same  use  of  time  and  space. 

However,  the  above  assembly-language  instructions  are  richer  than  is  absolutely  essential 
to  perform  all  computations.  In  fact  with  just  five  assembly-language  instructions,  namely 
INC,  DEC,  CONTINUE,  R2  JMP+  N,,  and  Rj  JMP_  Nj,  all  the  other  instructions  can  be 
realized.  (See  Problem  3.21.) 

3.4.4  Universality  of  the  Unbounded-Memory  RAM 

The  unbounded-memory  RAM  is  universal  in  two  senses.  First,  it  can  simulate  any  finite- 
state  machine  including  another  random-access  machine,  and  second,  it  can  execute  any  RAM 
program. 

DEFINITION  3.4.1  AmacbineM  is  universal  for  a  class  of  machines  C  if  every  machine  in  C  can 
he  simulated  by  M.  (A  stronger  definition  requiring  that  M  also  be  in  C  is  used  in  Section  3. 8.) 

We  now  show  that  the  RAM  is  universal  for  the  class  C  of  finite-state  machines.  We  show 
that  in  0(T )  steps  and  with  constant  storage  capacity  S  the  RAM  can  simulate  T  steps  of  any 
other  FSM.  Since  any  random-access  machine  that  uses  a  bounded  amount  of  memory  can  be 
described  by  a  logic  circuit  such  as  the  one  defined  in  Section  3.10,  it  can  also  be  simulated  by 
the  RAM. 

THEOREM  3.4. 1  Every  T-step  FSM  M  =  (E,  T,  Q,  S,  A,  s,  F)  computation  can  be  simulated 
by  a  RAM  in  0(T)  steps  with  constant  space.  Thus,  the  RAM  is  universal for  finite-state  machines. 

Proof  We  sketch  a  proof.  Since  an  FSM  is  characterized  completely  by  its  next-state  and 
output  functions,  both  of  which  are  assumed  to  be  encoded  by  binary  functions,  it  suffices  to 
write  a  fixed-length  RAM  program  to  perform  a  state  transition,  generate  output,  and  record 
the  FSM  state  in  the  RAM  memory  using  the  tabular  descriptions  of  the  next-state  and 
output  functions.  This  program  is  then  run  repeatedly.  The  amount  of  memory  necessary 
for  this  simulation  is  finite  and  consists  of  the  memory  to  store  the  program  plus  one  state 
(requiring  at  least  log2  |Q|  bits).  While  the  amount  of  storage  and  time  to  record  and 
compute  these  functions  is  constant,  they  can  be  exponential  in  log2  |<5|  because  the  next- 
state  and  output  functions  can  be  a  complex  binary  function.  (See  Section  2.12.)  Thus,  the 
number  of  steps  taken  by  the  RAM  per  FSM  state  transition  is  constant.  ■ 

The  second  notion  of  universality  is  captured  by  the  idea  that  the  RAM  can  execute  RAM 
programs.  We  discuss  two  execution  models  for  RAM  programs.  In  the  first,  a  RAM  program 
is  stored  in  a  private  memory  of  the  RAM,  say  in  the  CPU.  The  RAM  alternates  between 
reading  instructions  from  its  private  memory  and  executing  them.  In  this  case  the  registers 
described  in  Section  3.4.3  are  locations  in  the  random-access  memory.  The  program  counter 
either  advances  to  the  next  instruction  in  its  private  memory  or  jumps  to  a  new  location  as  a 
result  of  a  jump  instruction. 

In  the  second  model  (called  by  some  [10]  the  random-access  stored  program  machine 
(RASP)),  a  RAM  program  is  stored  in  the  random-access  memory  itself.  A  RAM  program 
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can  be  translated  to  a  RASP  program  by  replacing  the  names  of  RAM  registers  by  the  names 
of  random-access  memory  locations  not  used  for  storing  the  RAM  program.  The  execution 
of  a  RASP  program  directly  parallels  that  of  the  RAM  program;  that  is,  the  RASP  alternates 
between  reading  instructions  and  executing  them.  Since  we  do  not  consider  the  distinction 
between  RASP  and  RAM  significant,  we  call  them  both  the  RAM. 

3.5  Random-Access  Memory  Design 

In  this  section  we  model  the  random-access  memory  described  in  Section  3.4  as  an  FSM 
-Mrmem©  6)  that  has  m  =  6-bit  data  words,  Wq,  Wi,  . .  . ,  wm-\,  as  well  as  an  input 
data  word  d  (in_wrd),  an  input  address  a  (addr),  and  an  output  data  word  z  (out_wrd).  (See 
Fig.  3.20.)  The  state  of  this  FSM  is  the  concatenation  of  the  contents  of  the  data,  input  and 
output  words,  input  address,  and  the  command  word.  We  construct  an  efficient  logic  circuit 
for  its  next-state  and  transition  function. 

To  simplify  the  design  of  the  FSM  Mrmem  we  use  the  following  encodings  of  the  three 
input  commands: 


Name 

Sl 

so 

no-op 

0 

0 

read 

0 

1 

write 

1 

0 

An  input  to  JWrmem  is  a  binary  (/z  +  b  +  2) -bit  binary  tuple,  two  bits  to  represent  a 
command,  fi  bits  to  specify  an  address,  and  6  bits  to  specify  a  data  word.  The  output  function 
of  Mrmem>  ArmeMj  is  a  simple  projection  operator  and  is  realized  by  a  circuit  without  any 
gates.  Applied  to  the  state  vector,  it  produces  the  output  word. 

We  now  describe  a  circuit  for  <5rmeM;  the  next-state  function  of  Mrmem-  Memory  words 
remain  unchanged  if  either  no-op  or  read  commands  are  executed.  In  these  cases  the  value 
of  the  command  bit  Si  is  0.  One  memory  word  changes  if  Si  =  1,  namely,  the  one  whose 


Figure  3.20  A  random-access  memory  unit  Mrmem  that  holds  m  6-bit  words.  Its  inputs 
consist  of  a  command  (cmd),  an  input  word  (in_wrd),  and  an  address  (addr).  It  has  one  output 
word  (out_wrd). 
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address  is  a.  Thus,  the  memory  words  w0,  W\, . . . ,  wm_i  change  only  when  Sj  =  1.  The 
word  that  changes  is  determined  by  the  y-bit  address  a  supplied  as  part  of  the  input.  Let 
aM_i, . . .  ,ai,ao  be  the  fi  bits  of  a.  Let  these  bits  be  supplied  as  inputs  to  an  ^r-bit  decoder 
function  /^ode  (see  Section  2.5.4).  Let  ym- 1,  ■  •  •  ,Ui,yo  be  the  m  outputs  of  a  decoder 
circuit.  Then,  the  Boolean  function  Cj  =  S\yi  (shown  in  Fig.  3.21(a))  is  1  exactly  when 
the  input  address  a  is  the  binary  representation  of  the  integer  i  and  the  FSM  Mr  mem  is 
commanded  to  write  the  word  d  at  address  a. 

Let  Wq,  w*,  . . . ,  w^n_l  be  the  new  values  for  the  memory  words.  Let  w*j  and  u>ij  be  the 
jth  components  of  w*  and  Wi,  respectively.  Then,  for  0  <  i  <  m  —  1  and  0  <  j  <  b  —  1  we 
write  w*j  in  terms  of  Witj  and  the  jth  component  dj  of  d  as  follows: 

C-i 

w*j  =  CiWij  V  Cidj 

Figures  3.21(a)  and  (b)  show  circuits  described  by  these  formulas.  It  follows  that  changes 
to  memory  words  can  be  realized  by  a  circuit  containing  Cn  (/decode)  gates  f°r  the  decoder, 
m  gates  to  compute  all  the  terms  Ci,  0  <  i  <  m  —  1,  and  4mb  gates  to  compute  w*^ ,  0  <  i  < 
m~\,0<j<b  —  1  (NOTs  are  counted).  Combining  this  with  Lemma  2.5.4,  we  have  that 


Figure  3.2  I  A  circuit  that  realizes  the  next-state  and  output  function  of  the  random-access 
memory.  The  circuit  in  (a)  computes  the  next  values  {  vj*j  }  for  components  of  memory  words, 
whereas  that  in  (b)  computes  components  {z*  }  of  the  output  word.  The  output  yj  A  Wij  of  (a) 
is  an  input  to  (b). 
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a  circuit  realizing  this  portion  of  the  next-state  function  has  at  most  m(4b  +  2)  +  (2/z  —  2)a/to. 
gates.  The  depth  of  this  portion  of  the  circuit  is  the  depth  of  the  decoder  plus  4  because  the 
longest  path  between  an  input  and  an  output  Wq,  wf  . . . ,  *s  through  the  decoder  and 

then  through  the  gates  that  form  CiWij.  This  depth  is  at  most  |~log2  /z]  +  5. 

The  circuit  description  is  complete  after  we  give  a  circuit  to  compute  the  output  word  z. 
The  value  of  z  changes  only  when  So  =  1,  that  is,  when  a  read  command  is  issued.  The  jth 
component  of  z,  namely  Zj,  is  replaced  by  the  value  of  Wij,  where  i  is  the  address  specified  by 
the  input  a.  Thus,  the  new  value  of  Zj,  z*,  can  be  represented  by  the  following  formula  (see 
the  circuit  of  Fig.  3.21(b)): 


Zj  =  S0Zj  V  s0 


V  Vkwk,j 

\  k—0  / 


for  0  <  j  <  b  -  1 


Here  V  denotes  the  OR  of  the  m  terms  ykWkj,  m  =  2M.  It  follows  that  for  each  value  of 
j  this  portion  of  the  circuit  can  be  realized  with  m  two-input  AND  gates  and  to  —  1  two-input 
OR  gates  (to  form  \J)  plus  four  additional  operations.  Thus,  it  is  realized  by  an  additional 
(2m  +  3 )b  gates.  The  depth  of  this  circuit  is  the  depth  of  the  decoder  (|"log/z"|  +  1)  plus 
/z  =  log2  to,  the  depth  of  a  tree  of  to  inputs  to  form  \J,  plus  three  more  levels.  Thus,  the 
depth  of  the  circuit  to  produce  the  output  word  is  /z  +  |~log2  /z]  +  4. 

The  size  of  the  complete  circuit  for  the  next-state  function  is  at  most  m(6b  +  2)  +  (2/z  — 
2)  y/m  +  3b.  Its  depth  is  at  most  /z  +  |~log2  /z]  +  4.  We  state  these  results  as  a  lemma. 

LEMMA  3.5. 1  The  next-state  and  output  functions  of  the  FSM  1Wrmem(/x,  b),  <5rmem  and 
Armem>  can  be  realized  with  the  following  size  and  depth  bounds  over  the  standard  basis  fl0, 
where  S  =  mb  is  its  storage  capacity  in  bits: 

Cn0(dRMEM.  Armem)  <  m(6b  +  2)  +  (2/z  -  2 )y/m  +  3 b  =  O(S) 

Oq0(Srmem,  Armem)  <  M  +  |"Iog2  /z]  +  4  =  0(log(S/b)) 


Random-access  memories  can  be  very  large,  so  large  that  their  equivalent  number  of  logic 
elements  (which  we  see  from  the  above  lemma  is  proportional  to  the  storage  capacity  of  the 
memory)  is  much  larger  than  the  tens  to  hundreds  of  thousands  of  logic  elements  in  the  CPUs 
to  which  they  are  attached. 


3.6  Computational  Inequalities  for  the  RAM 

We  now  state  computational  inequalities  that  apply  for  all  computations  on  the  bounded- 
memory  RAM.  Since  this  machine  consists  of  two  interconnected  synchronous  FSMs,  we 
invoke  the  inequalities  of  Theorem  3.1.3,  which  require  bounds  on  the  size  and  depth  of  the 
next-state  and  output  functions  for  the  CPU  and  the  random-access  memory. 

From  Section  3.10.6  we  see  that  size  and  depth  of  these  functions  for  the  CPU  grow  slowly 
in  the  word  length  b  and  number  of  memory  words  to.  In  Section  3.5  we  designed  an  FSM 
modeling  an  S'-bit  random-access  memory  and  showed  that  the  size  and  depth  of  its  next-state 
and  output  functions  are  proportional  to  S  and  log  S,  respectively.  Combining  these  results, 
we  obtain  the  following  computational  inequalities. 
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THEOREM  3.6.1  Let  f  be  a  subfunction  of  /ram’6’*'  function  computed  by  the  m-word,  b-bit 
RAM  with  storage  capacity  S  =  mb  in  T  steps.  Then  the  following  bounds  bold  simultaneously 
over  the  standard  basis  LIq  for  logic  circuits: 

Cn.(/)  =  O(ST) 

Dn„(f)  =  0(T log  S) 

The  discussion  in  Section  3.1.2  of  computational  inequalities  for  FSMs  applies  to  this  the¬ 
orem.  In  addition,  this  theorem  demonstrates  the  importance  of  the  space-time  product,  ST, 
as  well  as  the  product  T  log  S.  While  intuition  may  suggest  that  ST  is  a  good  measure  of  the 
resources  needed  to  solve  a  problem  on  the  RAM,  this  theorem  shows  that  it  is  a  fundamental 
quantity  because  it  directly  relates  to  another  fundamental  complexity  measure,  namely,  the 
size  of  the  smallest  circuit  for  a  function  /.  Similar  statements  apply  to  the  second  inequality. 

It  is  important  to  ask  how  tight  the  inequalities  given  above  are.  Since  they  are  both  derived 
from  the  inequalities  of  Theorem  3.1.1,  this  question  can  be  translated  into  a  question  about 
the  tightness  of  the  inequalities  of  this  theorem.  The  technique  given  in  Section  3.2  can  be 
used  to  tighten  the  second  inequality  of  Theorem  3.1.1  so  that  the  bounds  on  circuit  depth 
can  be  improved  to  logarithmic  in  T  without  sacrificing  the  linearity  of  the  bound  on  circuit 
size.  However,  the  coefficients  on  these  bounds  depend  on  the  number  of  states  and  can  be 
very  large. 

3.7  Turing  Machines 

The  Turing  machine  model  is  the  classical  model  introduced  by  Alan  Turing  in  his  famous 
1936  paper  [338].  No  other  model  of  computation  has  been  found  that  can  compute  func¬ 
tions  that  a  Turing  machine  cannot  compute.  The  Turing  machine  is  a  canonical  model  of 
computation  used  by  theoreticians  to  understand  the  limits  on  serial  computation,  a  topic 
that  is  explored  in  Chapter  5.  The  Turing  machine  also  serves  as  the  primary  vehicle  for  the 
classification  of  problems  by  their  use  of  space  and  time.  (See  Chapter  8.) 

The  (deterministic)  one-tape,  bounded-memory  Turing  machine  (TM)  consists  of  two 
interconnected  FSMs,  a  control  unit  and  a  tape  unit  of  potentially  unlimited  storage  capacity. 


Figure  3.22  A  bounded-memory  one-tape  Turing  machine. 
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(It  is  shown  schematically  in  Fig.  3.22.)  At  each  unit  of  time  the  control  unit  accepts  input 
from  the  tape  unit  and  supplies  output  to  it.  The  tape  unit  produces  the  value  in  the  cell 
under  the  head,  a  6-bit  word,  and  accepts  and  writes  a  6-bit  word  to  that  cell.  It  also  accepts 
commands  to  move  the  head  one  cell  to  the  left  or  right  or  not  at  all.  The  bounded-memory 
tape  unit  is  an  array  of  m  6-bit  cells  and  has  a  storage  capacity  of  S  =  mb  bits.  A  formal 
definition  of  the  one-tape  deterministic  Turing  machine  is  given  below. 

DEFINITION  3.7. 1  A  standard  Turing  machine  (TM)  is  a  six-tuple  M  =  (T,  f3,  Q,  S,  s,  h), 

where  T  is  the  tape  alphabet  not  containing  the  blank  symbol  (3,  Q  is  the  finite  set  of  states, 
5  :  Q  x  (T  U  {/?})  ^  (QU  {fi})  x  (TU  {/?})  x  {L,  N,  R}  is  the  next-state  function,  s  is 
the  initial  state,  and  h  f.  Q  is  the  accepting  halt  state.  A  TM  cannot  exit  from  h.  IfM  is  in 
state  q  with  letter  a  under  the  tape  head  and  5(q,  a)  =  (q1 ,  a' ,  C),  its  control  unit  enters  state  q' 
and  writes  a'  in  the  cell  under  the  head,  and  moves  the  head  left  (if possible),  right,  or  not  at  all  if 
C  is  L,  R,  orN,  respectively. 

The  TM  M  accepts  the  input  string  ui  €  I  *  (it  contains  no  blanks)  if,  when  started  in 
state  s  with  w  placed  left-adjusted  on  its  otherwise  blank  tape  and  the  tape  head  at  the  leftmost 
tape  cell,  the  last  state  entered  by  M  is  h.  IfM  has  other  halting  states  (states  from  which  it  does 
not  exit)  these  are  rejecting  states.  Also,  M  may  not  halt  on  some  inputs. 

M  accepts  the  language  L(M)  consisting  of  all  strings  accepted  by  M.  If  a  Turing  machine 
halts  on  all  inputs,  we  say  that  it  recognizes  the  language  that  it  accepts.  For  simplicity,  we 
assume  that  when  M  halts  during  language  acceptance  it  writes  the  letter  1  in  its  first  tape  cell  if  its 
input  string  is  accepted  and  0  othenvise. 

The  function  computed  by  a  Turing  machine  on  input  string  w  is  the  string  z  written 
leftmost  on  the  non-blank  portion  of  the  tape  after  halting.  The  function  computed  by  a  TM  is 
partial  if  the  TM  fails  to  halt  on  some  input  strings  and  complete  otherwise. 

Thus,  a  TM  performs  a  computation  on  input  string  w,  which  is  placed  left-adjusted  on 
its  tape  by  placing  its  head  over  the  leftmost  symbol  of  w  and  repeatedly  reading  the  symbol 
under  the  tape  head,  making  a  state  change  in  its  control  unit,  and  producing  a  new  symbol 
for  the  tape  cell  and  moving  the  head  left  or  right  by  one  cell  or  not  at  all.  The  head  does  not 
move  left  from  the  leftmost  tape  cell.  If  a  TM  is  used  for  language  acceptance,  it  accepts  w  by 
halting  in  the  accepting  state  h.  If  the  TM  is  used  for  computation,  the  result  of  a  computation 
on  input  w  is  the  string  z  that  remains  on  the  non-blank  portion  of  its  tape. 

We  require  that  M  store  the  letter  1  or  0  in  its  first  tape  cell  when  halting  during  language 
acceptance  to  simplify  the  construction  of  a  circuit  simulating  M  in  Section  3.9.1.  This  re¬ 
quirement  is  not  essential  because  the  fact  that  M  has  halted  in  state  h  can  be  detected  with  a 
simple  circuit. 

The  multi-tape  Turing  machine  is  a  generalization  of  this  model  that  has  multiple  tape 
units.  (These  models  and  limits  on  their  ability  to  solve  problems  are  examined  in  Chapter  5, 
where  it  is  shown  that  the  multi-tape  TM  is  no  more  powerful  than  the  one-tape  TM.)  Al¬ 
though  in  practice  a  TM  uses  a  bounded  number  of  memory  locations,  the  full  power  of  TMs 
is  realized  only  when  they  have  access  to  an  unbounded  number  of  tape  cells. 

Although  the  TM  is  much  more  limited  than  the  RAM  in  the  flexibility  with  which  it  can 
access  memory,  given  sufficient  time  and  storage  capacity  they  both  compute  exactly  the  same 
set  of  functions,  as  we  show  in  Section  3.8. 

A  very  important  class  of  languages  recognized  by  TMs  is  the  class  P  of  polynomial-time 
languages. 
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DEFINITION  3.7.2  A  language L  C  T*  is  in  P  if  there  is  a  Turing  machine  M  with  tape  alphabet 
r  and  a  polynomial  p(n)  such  that,  for  every  w  G  1*,  a)  M  halts  in  p(\w\)  steps  and  b)  M 
accepts  w  if and  only  if  it  is  in  L. 

The  class  P  is  said  to  contain  all  the  “feasible”  languages  because  any  language  requiring 
more  than  a  polynomial  number  of  steps  for  its  recognition  is  thought  to  require  so  much  time 
for  long  strings  as  not  to  be  recognizable  in  practice. 

A  second  important  class  of  languages  is  NP,  the  languages  accepted  in  polynomial  time 
by  nondeterministic  Turing  machines.  To  define  this  class  we  introduce  the  nondeterministic 
Turing  machines. 

3.7. 1  Nondeterministic  Turing  Machines 

A  nondeterministic  Turing  machine  (NDTM)  is  identical  to  the  standard  TM  except  that 
its  control  unit  has  an  external  choice  input.  (See  Fig.  3.23.) 

DEFINITION  3.7.3  A  non-deterministic  Turing  machine  (NDTM)  is  the  extension  of  the  TM 
model  by  the  addition  of  a  choice  input  to  its  control  unit.  Thus  an  NDTM  is  a  seven-tuple 
M  =  (£,  T,  B,  Q,  6,  s,  h),  where  £  is  the  choice  input  alphabet.  T  is  the  tape  alphabet  not 
containing  the  blank  symbol  /3,  Q  is  the  finite  set  of  states,  s  is  the  initial  state,  and  h  ^  Q 
is  the  accepting  halt  state.  A  TM  cannot  exit  from  h.  When  M  is  in  state  q  with  letter  a  under 
the  tape  head,  reading  choice  input  c,  its  next-state  function  S  :  Q  x  £  x  (T  U  {/3})  i— > 
(Q  u  {/i})  X  (r  U  {B})  x  {L,  R,  N}  u  _L  has  value  S(q,  c,  a).  If  5(q,  c,  a)  =  _L,  there  is  no 
successor  to  the  current  state  with  choice  input  c  and  tape  symbol  a.  If  S(q,c,a)  =  ( q'  ,a!  ,C),  M’s 
control  unit  enters  state  q' ,  writes  a'  in  the  cell  under  the  head,  and  moves  the  head  left  (if possible), 
right,  or  not  at  all  if  C  is  L,  R,  or  N,  respectively.  The  choice  input  selects  possible  transitions  on 
each  time  step. 

An  NDTM  M  reads  one  character  of  its  choice  input  string  c  G  £*  on  each  step.  An 
NDTM  M  accepts  string  w  if  there  is  some  choice  string  c  such  that  the  last  state  entered  by  M  is 
h  when  M  is  started  in  state  s  with  w  placed  left-adjusted  on  its  otherwise  blank  tape  and  the  tape 
head  at  the  leftmost  tape  cell.  We  assume  that  when  M  halts  during  language  acceptance  it  writes 
the  letter  1  in  its  first  tape  cell  if  its  input  string  is  accepted  and  0  otherwise. 

An  NDTM  M  accepts  the  language  L  (M)  C  T*  consistingof those  strings  w  that  it  accepts. 
Thus,  ifw  (jl  L(M),  there  is  no  choice  input  for  which  M  accepts  w. 

Note  that  the  choice  input  c  associated  with  acceptance  of  input  string  w  is  selected  with  full 
knowledge  of  w.  Also,  note  that  an  NDTM  does  not  accept  any  string  not  in  L(M);  that  is, 
for  no  choice  inputs  does  it  accept  such  a  string. 

The  NDTM  simplifies  the  characterization  of  languages.  It  is  used  in  Section  8.10  to 
characterize  the  class  NP  of  languages  accepted  in  nondeterministic  polynomial  time. 

DEFINITION  3.7.4  A  language  L  C  T*  is  in  NP  if  there  is  a  nondeterministic  Turing  machine 
M  and  a  polynomial  p(n )  such  that  M  accepts  L  and  for  each  w  G  L  there  is  a  choice  input  c 
such  that  M  on  input  w  with  this  choice  input  halts  in  p(  \  w  \ )  steps. 

A  choice  input  is  said  to  “verify”  membership  of  a  string  in  a  language.  The  particular 
string  provided  by  the  choice  agent  is  a  verifier  for  the  language.  The  languages  in  NP  are  thus 
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Figure  3.23  A  nondeterministic  Turing  machine  modeled  as  a  deterministic  one  whose  control 
unit  has  an  external  choice  input  that  disambiguates  the  value  of  its  next  state. 


easy  to  verify:  they  can  be  verified  in  a  polynomial  number  of  steps  by  a  choice  input  string  of 
polynomial  length. 

The  class  NP  contains  many  important  problems.  The  Traveling  Salesperson  Problem 
(TSP)  is  in  this  class.  TSP  is  a  set  of  strings  of  the  following  kind:  each  string  contains  an 
integer  n,  the  number  of  vertices  (cities)  in  an  undirected  graph  G,  as  well  as  distances  between 
every  pair  of  vertices  in  G,  expressed  as  integers,  and  an  integer  k  such  that  there  is  a  path  that 
visits  each  city  once,  returning  to  its  starting  point  (a  tour),  whose  length  is  at  most  k.  A 
verifier  for  TSP  is  an  ordering  of  the  vertices  such  that  the  total  distance  traveled  is  no  more 
than  k.  Since  there  are  n\  orderings  of  the  n  vertices  and  n\  is  approximately  y/2Tvnnn e~n ,  a 
verifier  can  be  found  in  a  number  of  steps  exponential  in  n;  the  actual  verification  itself  can  be 
done  in  0(n2)  steps.  (See  Problem  3.24.)  NP  also  contains  many  other  important  languages, 
in  particular,  languages  defining  important  combinatorial  problems. 

While  it  is  obvious  that  P  is  a  subset  of  NP,  it  is  not  known  whether  they  are  the  same. 
Since  for  each  language  L  in  NP  there  is  a  polynomial  p  such  that  for  each  string  w  in  L 
there  is  a  verifying  choice  input  c  of  length  p(|ti;|),  a  polynomial  in  the  length  of  w,  the 
number  of  possible  choice  strings  c  to  be  considered  in  search  of  a  verifying  string  is  at  most 
an  exponential  in  |w?|.  Thus,  for  every  language  in  NP  there  is  an  exponential-time  algorithm 
to  recognize  it. 

Despite  decades  of  research,  the  question  of  whether  P  is  equal  to  NP,  denoted  P  =  NP, 
remains  open.  It  is  one  of  the  great  outstanding  questions  of  computer  science  today.  The 
approach  taken  to  this  question  is  to  identify  NP-complete  problems  (see  Section  8.10),  the 
hardest  problems  in  NP,  and  then  attempt  to  determine  problems  whether  or  not  such  prob¬ 
lems  are  in  P.  TSP  is  one  of  these  NP-complete  problems. 


3.8  Universality  of  the  Turing  Machine 

We  show  the  existence  of  a  universal  Turing  machine  in  two  senses.  On  the  one  hand,  we  show 
that  there  is  a  Turing  machine  that  can  simulate  any  RAM  computation.  Since  every  Turing 
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machine  can  be  simulated  by  the  RAM,  the  Turing  machine  simulating  a  RAM  is  universal  for 
the  set  of  all  Turing  machines. 

Also,  because  there  is  a  Turing  machine  that  can  simulate  any  RAM  computation,  every 
RAM  program  can  be  simulated  on  this  Turing  machine.  Since  it  is  not  hard  to  see  that  every 
Turing  machine  can  be  described  by  a  RAM  program  (see  Problem  3.29),  it  follows  that  the 
RAM  programs  are  exactly  the  programs  computed  by  Turing  machines.  Consequently,  the 
RAM  is  also  universal. 

The  following  theorem  demonstrates  that  RAM  computations  can  be  simulated  by  Turing- 
machine  computations  and  vice  versa  when  each  operates  with  bounded  memory.  Note  that 
all  halting  computations  are  bounded-memory  computations.  A  direct  proof  of  the  existence 
of  a  universal  Turing  machine  is  given  in  Section  5.5. 

THEOREM  3.8. I  Let  S  =  mb  and  m  >  b.  Then  for  every  m-word,  b-bit  Turing  machine 
(with  storage  capacity  S)  there  is  an  O(m)-word,  b-bit  RAM  that  simulates  a  time  T  computation 
of  Mtm  in  time  0(T )  and  storage  0(5).  Similarly,  for  every  m-word,  b-bit  RAM  Mr  am 
there  is  an  0((m/b)  log  m)-ivord,  0(b)-bit  Turing  machine  that  simidates  a  T-time,  S-storage 
computation  of  Mr  am  in  time  O  (ST  log2  S )  and  storage  0(5  log  5). 

Proof  We  begin  by  describing  a  RAM  that  simulates  a  TM.  Consider  a  6-bit  RAM  program 
to  simulate  an  m-word,  6-bit  TM.  As  shown  in  Theorem  3.4.1,  a  RAM  program  can  be 
written  to  simulate  one  step  of  an  FSM.  Since  a  TM  control  unit  is  an  FSM,  it  suffices  to 
exhibit  a  RAM  program  to  simulate  a  tape  unit  (also  an  FSM);  this  is  straightforward,  as 
is  combining  the  two  programs.  If  the  RAM  has  storage  capacity  proportional  to  that  of 
the  TM,  then  the  RAM  need  only  record  with  one  additional  word  the  position  of  the  tape 
head.  This  word,  which  can  be  held  in  a  RAM  register,  is  incremented  or  decremented  as 
the  head  moves.  The  resulting  program  runs  in  time  proportional  to  the  running  time  of 
the  TM. 

We  now  describe  a  6* -bit  TM  that  simulates  a  RAM,  where  6*  =  [log  m]  +  6  +  c  for 
some  constant  C,  an  assumption  we  examine  later.  Let  RAM  words  and  their  corresponding 
addresses  be  placed  in  individual  cells  on  the  tape  of  the  TM,  as  suggested  in  Fig.  3.24.  Let 
the  address  addr  of  the  RAM  CPU  program  counter  be  placed  on  the  tape  of  the  TM  to  the 
left,  as  suggested  by  the  shading  in  the  figure.  (It  is  usually  assumed  that,  unlike  the  RAM, 
the  TM  holds  words  of  size  no  larger  than  0(6)  in  its  control  unit.)  The  TM  simulates 
a  RAM  by  simulating  the  RAM  fetch- and-execute  cycle.  This  means  it  fetches  a  word  at 
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Figure  3.24  Organization  of  a  tape  unit  to  simulate  a  RAM.  Each  RAM  memory  word  Wj  is 
accompanied  by  its  address  j  in  binary. 
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address  addr  in  the  simulated  RAM  memory  unit,  interprets  it  as  an  instruction,  and  then 
executes  the  instruction  (which  might  require  a  few  additional  accesses  to  the  memory  unit 
to  read  or  write  data).  We  return  to  the  simulation  of  the  RAM  CPU  after  we  examine  the 
simulation  of  the  RAM  memory  unit. 

The  TM  can  find  a  word  at  location  addr  as  follows.  It  reads  the  most  significant  bit 
of  addr  and  moves  right  on  its  tape  until  it  finds  the  first  word  with  this  most  significant 
bit.  It  leaves  a  marker  at  this  location.  (The  symbol  <0  in  Fig.  3.24  identifies  the  first  place 
a  marker  is  left.)  It  then  returns  to  the  left-hand  end  of  the  tape  and  obtains  the  next  most 
significant  bit  of  addr.  It  moves  back  to  the  marker  <>  and  then  carries  this  marker  forward 
to  the  next  address  containing  the  next  most  significant  bit  (identified  by  the  marker  4|k  in 
Fig.  3.24).  This  process  is  repeated  until  all  bits  of  addr  have  been  visited,  at  which  point 
the  word  at  location  addr  in  the  simulated  RAM  is  found.  Since  to  tape  unit  cells  are  used 
in  this  simulation,  at  most  0(m  log  to)  TM  steps  are  taken  for  this  purpose. 

The  TM  must  also  simulate  internal  RAM  CPU  computations.  Each  addition,  sub¬ 
traction,  and  comparison  of  6-bit  words  can  be  done  by  the  TM  control  unit  in  a  constant 
number  of  steps,  as  can  the  logical  vector  operations.  (For  simplicity,  we  assume  that  the 
RAM  does  not  use  its  I/O  registers.  To  simulate  these  operations,  either  other  tapes  would 
be  used  or  space  would  be  reserved  on  the  single  tape  to  hold  input  and  output  words.)  The 
jump  instructions  as  well  as  the  incrementing  of  the  program  counter  require  moving  and 
incrementing  [log  to] -bit  addresses.  These  cannot  be  simulated  by  the  TM  control  unit 
in  a  constant  number  of  steps  since  it  can  only  operate  on  6-bit  words.  Instead,  they  are 
simulated  on  the  tape  by  moving  addresses  in  6-bit  blocks.  If  two  tape  cells  are  separated 
by  q  —  1  cells,  2 q  steps  are  necessary  to  move  each  block  of  6  bits  from  the  first  cell  to  the 
second.  Thus,  a  full  address  can  be  moved  in  2q\\\og  to] /6]  steps.  An  address  can  also 
be  incremented  using  ripple  addition  in  [[log  to]  /6]  steps  using  operations  on  6-bit  words, 
since  the  blocks  of  an  address  are  contiguous.  (See  Section  2.7  for  a  discussion  of  ripple 
addition.)  Thus,  both  of  these  address-manipulation  operations  can  be  done  in  at  most 
0{m\\ log  to]  /6] )  steps,  since  no  two  words  are  separated  by  more  than  O(m)  cells. 

Now  consider  the  general  case  of  a  TM  with  word  size  comparable  to  that  of  the  RAM, 
that  is,  a  size  too  small  to  hold  an  address  as  well  as  a  word.  In  particular,  consider  a  TM  with 
6-bit  tape  alphabet  where  b  =  cb,  c  >  la  constant.  In  this  case,  we  divide  addresses  into 
[log  to]  /6  6-bit  words  and  place  these  words  in  locations  that  precede  the  value  of  the 
RAM  word  at  this  address,  as  suggested  in  Fig.  3.40.  We  also  place  the  address  addr  at  the 
beginning  of  the  tape  in  the  same  number  of  tape  words.  A  total  of  0(  (to/6)  (log  to))  0(6)- 
bit  words  are  used  to  store  all  this  data.  Now  assume  that  the  TM  can  carry  the  contents  of 
a  6-bit  word  in  its  control  unit.  Then,  as  shown  in  Problem  3.26,  the  extra  symbols  in  the 
TM’s  tape  alphabet  can  be  used  as  markers  to  find  a  word  with  a  given  address  in  at  most 
0( (to/6)  (log2  to))  TM  steps  using  storage  0((to/6)  log  to)  .  Hence  each  RAM  memory 
access  translates  into  0((m/&)(log2  m))  TM  steps  on  this  machine. 


Simulation  of  the  CPU  on  this  machine  is  straightforward.  Again,  each  addition,  sub¬ 
traction,  comparison,  and  logical  vector  operation  on  6-bit  words  can  be  done  in  a  constant 
number  of  steps.  Incrementing  of  the  program  counter  can  also  be  done  in  [[log to] /6] 
operations  since  the  cells  containing  this  address  are  contiguous.  However,  since  a  jump  op¬ 
eration  may  require  moving  an  address  by  O(m)  cells  in  the  6*-bit  TM,  it  may  now  require 

moving  it  by  0(m( log  to) /6)  cells  in  the  6-bit  TM  in  O  [in  ((log to) /6)2 )  steps. 
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Combining  these  results,  we  see  that  each  step  of  the  RAM  may  require  as  many  as 
0((ro((logm) /b)2)  steps  of  the  6-bit  TM.  This  machine  uses  storage  0((m/b)  log  to). 
Since  m  =  S/b,  the  conclusion  of  the  theorem  follows.  ■ 

This  simulation  of  a  bounded-memory  RAM  by  a  Turing  machine  assumes  that  the  RAM 
has  a  fixed  number  of  memory  words.  Although  this  may  appear  to  prevent  an  unbounded- 
memory  TM  from  simulating  an  unbounded-memory  RAM,  this  is  not  the  case.  If  the  Turing 
machine  detects  that  an  address  contains  more  than  the  number  of  bits  currently  assumed 
as  the  maximum  number,  it  can  increase  by  1  the  number  of  bits  allocated  to  each  memory 
location  and  then  resume  computation.  To  make  this  adjustment,  it  will  have  to  space  out  the 
memory  words  and  addresses  to  make  space  for  the  extra  bits.  (See  Problem  3.28.) 

Because  a  Turing  machine  with  no  limit  on  the  length  of  its  tape  can  be  simulated  by  a 
RAM,  this  last  observation  demonstrates  the  existence  of  universal  Turing  machines,  Tur¬ 
ing  machines  with  unbounded  memory  (but  with  fixed-size  control  units  and  bounded-size 
tape  alphabets)  that  can  simulate  arbitrary  Turing  machines.  This  matter  is  also  treated  in 
Section  5.5. 

Since  the  RAM  can  execute  RAM  programs,  the  same  is  true  of  the  Turing  machines.  As 
mentioned  above,  it  is  not  hard  to  see  that  every  Turing  machine  can  be  simulated  by  a  RAM 
program.  (See  Problem  3.29.)  As  a  consequence,  the  RAM  programs  are  exactly  the  programs 
that  can  be  computed  by  a  Turing  machine. 

While  the  above  remarks  apply  to  the  one-tape  Turing  machine,  they  also  apply  to  all  other 
Turing  machine  models,  such  as  double-ended  and  multi-tape  Turing  machines,  because  each 
of  these  can  also  be  simulated  by  the  one-tape  Turing  machine.  (See  Section  5.2.) 

3.9  Turing  Machine  Circuit  Simulations 

Just  as  every  T-step  finite-state  machine  computation  can  be  simulated  by  a  circuit,  so  can 
every  T-step  Turing  machine  computation.  We  give  two  circuit  simulations,  a  simple  one  that 
demonstrates  the  concept  and  another  more  complex  one  that  yields  a  smaller  circuit.  We  use 
these  two  simulations  in  Sections  3.9.5  and  3.9.6  to  establish  computational  inequalities  that 
must  hold  for  Turing  machines.  With  a  different  interpretation  they  provide  examples  of  P- 
complete  and  NP-complete  problems.  (See  also  Sections  8.9  and  8.10.)  These  results  illustrate 
the  central  role  of  circuits  in  theoretical  computer  science. 

3.9.1  A  Simple  Circuit  Simulation  of  TM  Computations 

We  now  design  a  circuit  simulating  a  computation  of  a  Turing  machine  M  that  uses  m  memory 
cells  and  T  steps.  Since  the  only  difference  between  a  deterministic  and  nondeterministic 
Turing  machine  is  the  addition  of  a  choice  input  to  the  control  unit,  we  design  a  circuit  for  a 
nondeterministic  Turing  machine. 

For  deterministic  computations,  the  circuit  simulation  provides  computational  inequalities 
that  must  be  satisfied  by  computational  resources,  such  as  space  and  time,  if  a  problem  is  to  be 
solved  by  M.  Such  an  inequality  is  stated  at  the  end  of  this  section. 

With  the  proper  interpretation,  the  circuit  simulation  of  a  deterministic  computation  is  an 
instance  of  a  P-complete  problem,  one  of  the  hardest  problems  in  P  to  parallelize.  Here  P  is 
the  class  of  polynomial-time  languages.  A  first  P-complete  problem  is  stated  in  the  following 
section.  This  topic  is  studied  in  detail  in  Section  8.9. 
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For  nondeterministic  computations,  the  circuit  simulation  produces  an  instance  of  an  NP- 
complete  problem,  a  hardest  problem  to  solve  in  NP.  Here  NP  is  the  class  of  languages  accepted 
in  polynomial  time  by  a  nondeterministic  Turing  machine.  A  first  NP-complete  problem  is 
stated  in  the  following  section.  This  topic  is  studied  in  detail  in  Section  8.10. 

THEOREM  3.9. 1  Any  computation  performed  by  a  one-tape  Turing  machine  M,  deterministic  or 
nondeterministic ,  on  an  input  string  w  in  T  steps  using  m  b-bit  memory  cells  can  be  simulated 
by  a  circuit  Cm,t  over  the  standard  complete  basis  fl  of  size  and  depth  O(ST)  and  0(T  log  S), 
respectively,  where  S  =  mb  is  the  storage  capacity  in  bits  ofM  s  tape.  For  the  deterministic  TM 
the  inputs  to  this  circuit  consist  of  the  values  of  w.  For  the  nondeterministic  TM  the  inputs  consist 
ofw  and  the  Boolean  choice  input  variables  ivhose  values  are  not  set  in  advance. 

Proof  To  construct  a  circuit  Cm.t  simulating  T  steps  by  M  is  straightforward  because  M 
is  a  finite-state  machine  now  that  its  storage  capacity  is  limited.  We  need  only  extend  the 
construction  of  Section  3.1.1  and  construct  a  circuit  for  the  next-state  and  output  functions 


Figure  3.25  The  circuit  Cm.t  simulates  an  m-cell,  T-step  computation  by  a  nondeterministic 
Turing  machine  M .  It  contains  T  copies  of  M’s  control  unit  circuit  and  T  column  circuits,  Ct, 
each  containing  cell  circuits  Cj.t,  0  <  j  <  m  —  1 ,  1  <  t  <  T,  simulating  the  j  th  tape  cell  on  the 
tth  time  step,  qt  and  Ct  are  M’s  state  on  the  fth  step  and  its  fth  set  of  choice  variables.  Also,  cij.t 
is  the  value  in  the  jth  cell  on  the  tth  step,  Sj.t  is  1  if  the  head  is  over  cell  j  at  the  tth  time  step,  and 
Vj.t  is  aj.t  if  Sj.t  =  1  and  0  otherwise.  Vt,  the  vector  OR  of  Vj.t,  0  <  j  <  m  —  1,  supplies  the 
value  under  the  head  to  the  control  unit,  which  computes  head  movement  commands,  ht,  and 
a  new  word,  Wt ,  for  the  current  cell  in  the  next  simulated  time  step.  The  value  of  the  function 
computed  by  M  resides  on  its  tape  after  the  Tth  step. 
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of  M.  As  shown  in  Fig.  3.25,  it  is  convenient  to  view  M  as  a  pair  of  synchronous  FSMs 
(see  Section  3.1.4)  and  design  separate  circuits  for  M’s  control  and  tape  units.  The  design 
of  the  circuit  for  the  control  unit  is  straightforward  since  it  is  an  unspecified  NFSM.  The 
tape  circuit,  which  realizes  the  next-state  and  output  functions  for  the  tape  unit,  contains 
m  cell  circuits,  one  for  each  cell  on  the  tape.  We  denote  by  Ct(jn),  1  <  f  <  T,  the  ith  tape 
circuit.  We  begin  by  constructing  a  tape  circuit  and  determining  its  size  and  depth. 

For  0  <  j  <  to  and  1  <  t  <  T  let  Gjj  be  the  jth  cell  circuit  of  the  fth  tape  circuit, 
Ct(m).  Cjtt  produces  the  value  aht  contained  in  the  jth  cell  after  the  jth  step  as  well  as 
Sj,t>  whose  value  is  1  if  the  head  is  over  the  jth  tape  cell  after  the  fth  step  and  0  otherwise. 
The  value  of  ajj  is  either  a,j,t-\  if  Sj,t  =  0  (the  head  is  not  over  this  cell)  or  w  if  Sjtt  =  1 
(the  head  is  over  the  cell).  Subcircuit  SCj  of  Fig.  3.26  performs  this  computation. 

Subcircuit  SC\  in  Fig.  3.26  computes  Sjj  from  Sj_ sj+i,t-i  and  the  triple 
ht  =  {hjj  ,  h°t,  h+l),  where  =  1  if  the  head  moves  to  the  next  lower-numbered  cell, 
=  1  if  it  moves  to  the  next  higher-numbered  cell,  or  /i®  =  1  if  it  does  not  move.  Thus, 
Sjj  =  1  if  _ l  =  1  and  =  f>  or  if  sj— 1,4-1  and  =  1,  or  if  Sj,t_i  =  1  and 

=  1.  Otherwise,  =  0. 

Subcircuit  SC 3  of  cell  circuit  Cj<t  generates  the  6-bit  word  Vjt  that  is  used  to  provide 
the  value  under  the  head  on  the  fth  step.  Vj  t  is  t  if  the  head  is  over  the  jth  cell  on  the 


Figure  3.26  The  cell  circuit  Cjtt  has  three  components:  SC\,  a  circuit  to  compute  the  new 
value  for  the  head  location  bit  Sj,t  from  the  values  of  this  quantity  on  the  preceding  step  at 
neighboring  cells  and  the  head  movement  vector  ht,  SC2,  a  circuit  to  replace  the  value  in  the  jth 
cell  on  the  t  step  with  the  input  w  if  the  head  is  over  the  cell  on  the  (f  —  l)st  step  (Sj,t-i  =  1), 
and  SCi,  a  circuit  to  produce  the  new  value  in  the  jth  cell  at  the  tth  step  if  the  head  is  over  this 
cell  ( Sj,t  =  1)  and  the  zero  vector  otherwise.  The  circuit  Cj,t  has  5(6  +1)  gates  and  depth  4. 
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6th  step  (Sjj  =  1)  and  0  otherwise.  The  vector-OR  of  Vj  t,  0  <  j  <  m  —  1 ,  is  formed  using 
the  tree  circuit  shown  in  Fig.  3.25  to  compute  the  value  of  the  6-bit  word  vt  under  the  head 
after  the  6th  step.  (This  can  be  done  by  6  balanced  binary  OR  trees,  each  with  size  m  —  1 
and  depth  |~log2  m] .)  vt  is  supplied  to  the  6th  copy  of  the  control  unit  circuit,  which  also 
uses  the  previous  state  of  the  control  unit,  qt,  and  the  choice  input  ct  (a  tuple  of  Boolean 
variables)  to  compute  the  next  state,  qt+i,  the  new  6-bit  word  Wt+\  for  the  current  tape  cell, 
and  the  head  movement  command  ht+\. 

Summarizing,  it  follows  that  the  6th  tape  circuit,  Ct(m),  uses  O(S)  gates  (here  S  =  mb) 
and  has  depth  0(log  S/b). 

Let  decontrol  and  -Dcontroi  be  the  size  and  depth  of  the  circuit  simulating  the  control 
unit.  It  follows  that  the  circuit  simulating  T  computation  steps  by  a  Turing  machine  M  has 
T  decontrol  gates  in  the  T  copies  of  the  control  unit  and  O(ST)  gates  in  the  tape  circuits  for  a 
total  of  O(ST)  gates.  Since  the  longest  path  through  the  circuit  of  Fig.  3.26  passes  through 
each  control  and  tape  circuit,  the  depth  of  this  circuit  is  0(T(DconiIoi  +  log  S/b))  = 
0(T  log  S). 

The  simulation  of  M  is  completed  by  placing  the  head  over  the  zeroth  cell  by  letting 
So,o  =  1  and  Sy0  =  0  for  j  f  0.  The  inputs  to  M  are  fixed  by  setting  ctj  0  =  Wj  for 
0  <  j  <  n  —  1  and  to  the  blank  symbol  for  j  >  n.  Finally,  v0  is  set  equal  to  a,jt 0,  the 
value  under  the  head  at  the  start  of  the  computation.  The  choice  inputs  are  sets  of  Boolean 
variables  under  the  control  of  an  outside  agent  and  are  treated  as  variables  of  the  circuit 
simulating  the  Turing  machine  M.  ■ 

We  now  give  two  interpretations  of  the  above  simulation.  The  first  establishes  that  the 
circuit  complexity  for  a  function  provides  a  lower  bound  to  the  time  required  by  a  computation 
on  a  Turing  machine.  The  second  provides  instances  of  problems  that  are  P-complete  and  NP- 
complete. 


3.9.2  Computational  Inequalities  for  Turing  Machines 

When  the  simulation  of  Theorem  3.9.1  is  specialized  to  a  deterministic  Turing  machine  M,  a 
circuit  is  constructed  that  computes  the  function  /  computed  by  M  in  T  steps  with  S  bits  of 
memory.  It  follows  that  Cn(/)  and  D(i{f)  cannot  be  larger  than  those  given  in  this  theorem, 
since  this  circuit  also  computes  /.  From  this  observation  we  have  the  following  computational 
inequalities. 

THEOREM  3.9.2  The  function  f  computed  by  an  m-word,  b-bit  one-tape  Turing  machine  in  T 
steps  can  also  be  computed  by  a  circuit  whose  size  and  depth  satisfy  the  following  bounds  over  any 
complete  basis  fl,  where  S  =  mb  is  the  storage  capacity  used  by  this  machine: 

Cn(f)  =  O(ST) 

Da(f)  =  OlTlog  S) 

Since  S  =  0(T)  (at  most  T  +  1  cells  can  be  visited  in  T  steps),  we  have  the  following 
corollary.  It  demonstrates  that  the  time  T  to  compute  a  function  /  with  a  Turing  machine  is 
at  least  the  square  root  of  its  circuit  size.  As  a  consequence,  circuit  size  complexity  can  be  used 
to  derive  lower  bounds  on  computation  time  on  Turing  machines. 
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COROLLARY  3.9.1  Let  the  function  f  he  computed  by  an  m-word,  h-bit  one-tape  Turing  machine 
in  T  steps,  b  fixed.  Then,  over  any  complete  basis  f l  the  following  inequality  must  hold: 

Cn{f)  =  0(T 2) 

There  is  no  loss  in  assuming  that  a  language  L  is  a  set  of  strings  over  a  binary  alpha¬ 
bet;  that  is,  L  C  B* .  As  explained  in  Section  1.2.3,  a  language  can  be  defined  by  a  family 
{/ 1,  fy,  f$, . . .}  of  characteristic  (Boolean)  functions,  fn  :  Bn  i— >  B,  where  a  string  w  of 
length  n  is  in  L  if  and  only  if  fn{w)  =  1. 

Theorem  3.9.2  not  only  establishes  a  clear  connection  between  Turing  time  complexity 

and  circuit  size  complexity,  but  it  also  provides  a  potential  means  to  resolve  the  question  P  = 
NP  of  whether  P  and  NP  are  equal  or  not.  Circuit  complexity  is  currently  believed  to  be  the 
most  promising  tool  to  examine  this  question.  (See  Chapter  9.) 

3.9.3  Reductions  from  Turing  to  Circuit  Computations 

As  shown  in  Theorem  3.9.1,  a  circuit  Cm.t  can  be  constructed  that  simulates  a  time-  and 
space-bounded  computation  by  either  a  deterministic  or  a  nondeterministic  Turing  machine 
M .  If  M  is  deterministic  and  accepts  the  binary  input  string  w,  then  Cm,T  has  value  1  when 
supplied  with  the  value  of  w.  If  M  is  nondeterministic  and  accepts  the  binary  input  string  w, 
then  for  some  values  of  the  binary  choice  variables  c,  Cm,t  on  inputs  w  and  c  has  value  1 . 

The  language  of  strings  describing  circuits  with  fixed  inputs  whose  value  on  these  inputs 
is  1  is  called  CIRCUIT  VALUE.  When  the  circuits  also  have  variable  inputs  whose  values  can 
be  chosen  so  that  the  circuits  have  value  1 ,  the  language  of  strings  describing  such  circuits  is 
called  CIRCUIT  SAT.  (See  Section  3.9.6.)  The  languages  CIRCUIT  VALUE  and  CIRCUIT  SAT 
are  examples  of  P-complete  and  NP-complete  languages,  respectively. 

The  P-complete  and  NP-complete  languages  play  an  important  role  in  complexity  the¬ 
ory:  they  are  prototypical  hard  languages.  The  P-complete  languages  can  all  be  recognized  in 
polynomial  time  on  serial  machines,  but  it  is  not  known  how  to  recognize  them  on  parallel 
machines  in  time  that  is  a  polynomial  in  the  logarithm  of  the  length  of  strings  (this  is  called 
poly-logarithmic  time),  which  should  be  possible  if  they  are  parallelizable.  The  NP-complete 
languages  can  be  recognized  in  exponential  time  on  deterministic  serial  machines,  but  it  is 
not  known  how  to  recognize  them  in  polynomial  time  on  such  machines.  Many  important 
problems  have  been  shown  to  be  P-complete  or  NP-complete. 

Because  so  much  effort  has  been  expended  without  success  in  trying  to  show  that  the 
NP  -complete  (P-complete)  languages  can  be  solved  serially  (in  parallel)  in  polynomial  (poly- 
logarithmic)  time,  it  is  generally  believed  they  cannot.  Thus,  showing  that  a  problem  is  NP- 
complete  (P-complete)  is  considered  good  evidence  that  a  problem  is  hard  to  solve  serially  (in 
parallel) . 

To  obtain  such  results,  we  exhibit  a  program  that  writes  the  description  of  the  circuit  Cm.t 
from  a  description  of  the  TM  M  and  the  values  written  initially  on  its  tape.  The  time  and 
space  needed  by  this  program  are  used  to  classify  languages  and,  in  particular,  to  identify  the 
P-complete  and  NP-complete  languages. 

The  simple  program  V  shown  schematically  in  Fig.  3.27  writes  a  description  of  the  circuit 
Cm.t  of  Fig.  3.25,  which  is  deterministic  or  nondeterministic  depending  on  the  nature  of 
M.  (Textual  descriptions  of  circuits  are  given  in  Section  2.2.  Also  see  Problem  3.8.)  The 
first  loop  of  this  program  reads  the  value  of  zth  input  letter  Wi  of  the  string  w  written  on 
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for  *  :=  0  to  n  —  1 

READ-VALUE©;) 
WRITE_INPUT(i,  Wi) 
for  j  :=  n  to  m  —  1 
WRITE-INPUTfl,  S) 

for  t  :=  1  to  T 

WRITE_CONTROL_UNIT(f,  Ct) 
WRITE_OR(t,  to) 
for  j  :=  0  to  m  —  1 

WRITE_CELL_CIRCUIT(j,  t) 


Figure  3.27  A  program  V  to  write  the  description  of  a  circuit  Cm.t  that  simulates  T  steps  of  a 
nondeterministic  Turing  machine  M  and  uses  m  memory  words.  It  reads  the  n  inputs  supplied 
to  M ,  after  which  it  writes  the  input  steps  of  a  straight-line  program  that  reads  these  n  inputs  as 
well  asm-rt  blanks  /3  into  the  first  copy  of  a  tape  unit.  It  then  writes  the  remaining  steps  of  a 
straight-line  program  consisting  of  descriptions  of  the  T  copies  of  the  control  unit  and  the  mT 
cell  circuits  simulating  the  T  copies  of  the  tape  unit. 


the  input  tape  of  T,  after  which  it  writes  a  fragment  of  a  straight-line  program  containing  the 
value  of  Wi.  The  second  loop  sets  the  remaining  initial  values  of  cells  to  the  blank  symbol  /3. 
The  third  outer  loop  writes  a  straight-line  program  for  the  control  unit  using  the  procedure 
WRITE_CONTROL_UNIT  that  has  as  arguments  t,  the  index  of  the  current  time  step,  and  ct, 
the  tuple  of  Boolean  choice  input  variables  for  the  <th  step.  These  choice  variables  are  not  used 
if  M  is  deterministic.  In  addition,  this  loop  uses  the  procedure  WRITE_OR  to  write  a  straight- 
line  program  for  the  vector  OR  circuit  that  forms  the  contents  vt  of  the  cell  under  the  head 
after  the  fth  step.  Its  inner  loop  uses  the  procedure  WRITE_CELL_CIRCUIT  with  parameters  j 
and  t  to  write  a  straight-line  program  for  the  j th  cell  circuit  in  the  fth  tape. 

The  program  V  given  in  Fig.  3.27  is  economical  in  its  use  of  space  and  time,  as  we  show. 
Consider  a  language  L  in  P;  that  is,  for  L  there  is  a  deterministic  Turing  machine  Ml  and  a 
polynomial  p(n)  such  that  on  an  input  string  w  of  length  n,  Ml  halts  in  T  =  p(n)  steps. 
It  accepts  w  if  it  is  in  L  and  rejects  it  otherwise.  Since  V  uses  space  logarithmic  in  the  values 
of  n  and  T  and  T  =  p(n),  V  uses  space  logarithmic  in  n.  (For  example,  if  p(n)  =  n6, 
log 2p{n)  =  61og2  n  =  O(logn).)  Such  programs  are  called  log-space  programs. 

We  show  in  Theorem  8.8.1  that  the  composition  of  two  log-space  programs  is  a  log-space 
program,  a  non-obvious  result.  However,  it  is  straightforward  to  show  that  the  composition  of 
two  polynomial-time  programs  is  a  polynomial-time  program.  (See  Problems  3.2  and  8.19.) 
Since  V’s  inner  and  outer  loops  each  execute  a  polynomial  number  of  steps,  it  follows  that  V 
is  a  polynomial-time  program. 

If  M  is  nondeterministic,  V  continues  to  be  a  log-space,  polynomial-time  program.  The 
only  difference  is  that  it  writes  a  circuit  description  containing  references  to  choice  variables 
whose  values  are  not  specified  in  advance.  We  state  these  observations  in  the  form  of  a  theorem. 

THEOREM  3.9.3  Let  L  £P  (L  £  NP).  Then  for  each  string  w  £  T*  a  deterministic  (nondeter¬ 
ministic)  circuit  Cm.t  can  he  constructed  by  a  program  in  logarithmic  space  and  polynomial  time 
in  n  =  |io|,  the  length  ofw,  such  that  the  output  of  Cm.t >  the  value  in  the  first  tape  cell,  is  (can 
be)  assigned  value  1  (for  some  values  of  the  choice  inputs)  if  w  £  L  and  0  if  w  (jl  L. 
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The  program  of  Fig.  3.27  provides  a  translation  (or  reduction  )  from  any  language  in  NP 
(or  P)  to  a  language  that  we  later  show  is  a  hardest  language  in  NP  (or  P) . 

We  now  use  Theorem  3.9.3  and  the  above  facts  to  give  a  brief  introduction  to  the  P- 
complete  and  NP-complete  languages,  which  are  discussed  in  more  detail  in  Chapter  8. 

3.9.4  Definitions  of  P-Complete  and  NP-Complete  Languages 

In  this  section  we  identify  languages  that  are  hardest  in  the  classes  P  and  NP.  A  language  Lq  is 
hardest  in  one  of  these  classes  if  a)  Lq  is  itself  in  the  class  and  b)  for  every  language  L  in  the 
class,  a  test  for  the  membership  of  a  string  w  in  L  can  be  constructed  by  translating  w  with  an 
algorithm  to  a  string  v  and  testing  for  membership  of  v  in  Lq.  If  the  class  is  P,  the  algorithm 
must  use  at  most  space  logarithmic  in  the  length  of  w,  whereas  in  the  case  of  NP,  the  algorithm 
must  use  time  at  most  a  polynomial  in  the  length  of  w.  Such  a  language  Lq  is  said  to  be  a 
complete  language  for  this  complexity  class.  We  begin  by  defining  the  P-complete  languages. 

DEFINITION  3.9. 1  A  language  L  C  B*  is  P-complete  if  it  is  in  P  and  if  for  every  language 
L0  C  B*  in  P,  there  is  a  log-space  deterministic  program  that  translates  each  w  £  B*  into  a  string 
w'  £  B*  such  that  w  £  Lq  if  and  only  if  w'  £  L. 

The  NP-complete  languages  have  a  similar  definition.  However,  instead  of  requiring  that 
the  translation  be  log-space,  we  ask  only  that  it  be  polynomial-time.  It  is  not  known  whether 
all  polynomial-time  computations  can  be  done  in  logarithmic  space. 

DEFINITION  3.9.2  A  language  L  C  B*  is  NP-complete  if  it  is  in  NP  and  if  for  every  language 
Lq  C  B*  in  NP,  there  is  a  polynomial-time  deterministic  program  that  translates  each  w  £  B* 
into  a  string  w'  £  B*  such  that  w  £  Lq  if  and  only  if  w'  £  L. 

Space  precludes  our  explaining  the  important  role  of  the  P-complete  languages.  We  simply 
report  that  these  languages  are  the  hardest  languages  to  parallelize  and  refer  the  reader  to  Sec¬ 
tions  8.9  and  8.14.2.  However,  we  do  explain  the  importance  of  the  NP-complete  languages. 

As  the  following  theorem  states,  if  an  NP-complete  language  is  in  P;  that  is,  if  membership 
of  a  string  in  an  NP-complete  language  can  be  determined  in  polynomial  time,  then  the  same 
can  be  done  for  every  language  in  NP;  that  is,  P  and  NP  are  the  same  class  of  languages. 
Since  decades  of  research  have  failed  to  show  that  P  =  NP,  a  determination  that  a  problem  is 
NP  -complete  is  a  testimonial  to  but  not  a  proof  of  its  difficulty. 

THEOREM  3.9.4  If  an  NP-complete  language  is  in  P,  then  P  =  NP. 

Proof  Let  L  be  NP-complete  and  let  Lq  be  an  arbitrary  language  in  NP.  Because  L  is  NP- 
complete,  there  is  a  polynomial-time  program  that  translates  an  arbitrary  string  w  into  a 
string  w  such  that  w'  £  L  if  and  only  if  w  £  Lq.  If  L  £  P,  then  testing  of  membership 
of  strings  in  Lq  can  be  done  in  polynomial  time  in  the  length  of  the  string.  It  follows  that 
there  exists  a  polynomial-time  program  to  determine  membership  of  a  string  in  Lq.  Thus, 
every  language  in  NP  is  also  in  P.  ■ 

3.9.5  Reductions  to  P-Complete  Languages 

We  now  formally  define  CIRCUIT  VALUE,  our  first  P-complete  language. 
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CIRCUIT  VALUE 

Instance:  A  circuit  description  with  fixed  values  for  its  input  variables  and  a  designated 

output  gate. 

Answer:  “Yes”  if  the  output  of  the  circuit  has  value  1 . 

THEOREM  3.9.5  The  language  CIRCUIT  VALUE  is  V-complete. 

Proof  To  show  that  CIRCUIT  VALUE  is  P-complete,  we  must  show  that  it  is  in  P  and 
that  every  language  in  P  can  be  translated  to  it  by  a  log-space  program.  We  have  already 
shown  the  second  half  of  the  proof  in  Theorem  3.9.1.  We  need  only  show  the  first  half, 
which  follows  from  a  simple  analysis  of  the  obvious  program.  Since  a  circuit  is  a  graph  of  a 
straight-line  program,  each  step  depends  on  steps  that  precede  it.  (Such  a  program  can  be 
produced  by  a  pre-order  traversal  of  the  circuit  starting  with  its  output  vertex.)  Now  scan 
the  straight-line  program  and  evaluate  and  store  in  an  array  the  value  of  each  step.  Successive 
steps  access  this  array  to  find  their  arguments.  Thus,  one  pass  over  the  straight-line  program 
suffices  to  evaluate  it;  the  evaluating  program  runs  in  linear  time  in  the  length  of  the  circuit 
description.  Hence  CIRCUIT  VALUE  is  in  P.  ■ 

When  we  wish  to  show  that  a  new  language  L\  is  P-complete,  we  first  show  that  it  is  in 
P.  Then  we  show  that  every  language  L  £  P  can  be  translated  to  it  in  logarithmic  space;  that 
is,  for  each  string  w,  there  is  an  algorithm  that  uses  temporary  space  0(log  |ut|)  (as  does  the 
program  in  Fig.  3.27)  that  translates  w  into  a  string  v  such  that  w  is  in  L  if  and  only  if  v  is 
in  L\.  (This  is  called  a  log-space  reduction.  See  Section  8.5  for  a  discussion  of  temporary 
space.) 

If  we  have  already  shown  that  a  language  Lq  is  P-complete,  we  ask  whether  we  can  save 
work  by  using  this  fact  to  show  that  another  language,  L\,  in  P  is  P-complete.  This  is  pos¬ 
sible  because  the  composition  of  two  deterministic  log-space  algorithms  is  another  log-space 
algorithm,  as  shown  in  Theorem  8.8.1.  Thus,  if  we  can  translate  L0  into  L\  with  a  log-space 
algorithm,  then  every  language  in  P  can  be  translated  into  L\  by  a  log-space  reduction.  (This 
idea  is  suggested  in  Fig.  3.28.)  Hence,  the  task  of  showing  L\  to  be  P-complete  is  reduced 
to  showing  that  L\  is  in  P  and  that  Lq,  which  is  P-complete,  can  be  translated  to  L\  by  a 
log-space  algorithm.  Many  P-complete  languages  are  exhibited  in  Section  8.9. 


Figure  3.28  A  language  Lq  is  shown  P-complete  by  demonstrating  that  Lq  is  in  P  and  that 
every  language  L  in  P  can  be  translated  to  it  in  logarithmic  space.  A  new  language  L\  is  shown 
P-complete  by  showing  that  it  is  in  P  and  that  Lq  can  be  translated  to  it  in  log-space.  Since  L  can 
be  L\,  L i  can  also  be  translated  to  Lq  in  log-space. 
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3.9.6  Reductions  to  NP-Complete  Languages 

Our  first  NP-complete  language  is  CIRCUIT  SAT,  a  language  closely  related  to  CIRCUIT 
VALUE. 

CIRCUIT  SAT 

Instance:  A  circuit  description  with  n  input  variables  {x\,  X2,  ■  ■  ■ ,  xn}  for  some  integer  n 
and  a  designated  output  gate. 

Answer:  “Yes”  if  there  is  an  assignment  of  values  to  the  variables  such  that  the  output  of  the 
circuit  has  value  1 . 

THEOREM  3.9.6  The  language  CIRCUIT  SAT  is  NP -complete. 

Proof  To  show  that  CIRCUIT  SAT  is  NP-complete,  we  must  show  that  it  is  in  NP  and  that 
every  language  in  NP  can  be  translated  to  it  by  a  polynomial-time  program.  We  have  already 
shown  the  second  half  of  the  proof  in  Theorem  3.9.1.  We  need  only  show  the  first  half.  As 
discussed  in  the  proof  of  Theorem  3.9.5,  each  circuit  can  be  organized  so  that  all  steps  on 
which  a  given  step  depends  precede  it.  We  assume  that  a  string  in  CIRCUIT  SAT  meets 
this  condition.  Design  an  NTM  which  on  such  a  string  uses  choice  inputs  to  assign  values 
to  each  of  the  variables  in  the  string.  Then  invoke  the  program  described  in  the  proof  of 
Theorem  3.9.5  to  evaluate  the  circuit.  For  some  assignment  to  the  variables  X\,  X2,  ■  ■  ■ ,  xn, 
this  nondeterministic  program  can  accept  each  string  in  CIRCUIT  SAT  but  no  string  not  in 
CIRCUIT  SAT.  It  follows  that  CIRCUIT  SAT  is  in  NP.  ■ 

The  model  used  to  show  that  a  language  is  P-complete  directly  parallels  the  model  used  to 
show  that  a  language  L\  is  NP-complete.  We  first  show  that  L\  is  in  NP  and  then  show  that 
every  language  L  £  NP  can  be  translated  to  it  in  polynomial  time.  That  is,  we  show  that  there 
is  a  polynomial  p  and  algorithm  that  on  inputs  of  length  n  runs  in  time  p(n),  and  that  for 
each  string  w  the  algorithm  translates  w  into  a  string  v  such  that  id  is  in  A  if  and  only  if  v  is 
in  L\.  (This  is  called  a  polynomial-time  reduction.)  Since  any  algorithm  that  uses  log-space 
(as  does  the  program  in  Fig.  3.27)  runs  in  polynomial  time  (see  Theorem  8.5.8),  a  log-space 
reduction  can  be  used  in  lieu  of  a  polynomial-time  reduction. 

If  we  have  already  shown  that  a  language  Lq  is  NP-complete,  we  can  show  that  another 
language,  L\,  in  NP  is  NP-complete  by  translating  L0  into  L\  with  a  polynomial-time  algo¬ 
rithm.  Since  the  composition  of  two  polynomial-time  algorithms  is  another  polynomial-time 
algorithm  (see  Problem  3.2),  every  language  in  NP  can  be  translated  in  polynomial  time  into 
L\  and  L\  is  NP  -complete.  The  diagram  shown  in  Fig.  3.28  applies  when  the  reductions 
are  polynomial-time  and  the  languages  are  members  of  NP  instead  of  P.  Many  NP-complete 
languages  are  exhibited  in  Section  8.10. 

We  apply  this  idea  to  show  that  SATISFIABILITY  is  NP-complete.  Strings  in  this  language 
consist  of  strings  representing  the  POSE  (product-of-sums  expansion)  of  a  Boolean  function. 
Thus,  they  consist  of  clauses  containing  literals  (a  variable  or  its  negation)  with  the  property 
that  for  some  value  of  the  variables  at  least  one  literal  in  each  clause  is  satisfied. 

SATISFIABILITY 

Instance:  A  set  of  literals  X  =  {xi,  X\,  X2,  *2,  •  •  ■ ,  xn,  xn}  and  a  sequence  of  clauses 
C  =  (ci,  C2, . . . ,  Cm)  where  each  clause  Ci  is  a  subset  of  X. 

Answer:  “Yes”  if  there  is  a  (satisfying)  assignment  of  values  for  the  variables  {xi,  X2,  ■  ■  ■ , 
xn}  over  the  set  B  such  that  each  clause  has  at  least  one  literal  whose  value  is  1. 
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Proof  SATISFIABILITY  is  in  NP  because  for  each  string  w  in  this  language  there  is  a  sat¬ 
isfying  assignment  for  its  variables  that  can  be  verified  by  a  polynomial-time  program.  We 
sketch  a  deterministic  RAM  program  for  this  purpose.  This  program  reads  as  many  choice 
variables  as  there  are  variables  in  w  and  stores  them  in  memory  locations.  It  then  evalu¬ 
ates  each  literal  in  each  clause  in  w  and  declares  this  string  satisfied  if  all  clauses  evaluate 
to  1.  This  program,  which  runs  in  time  linear  in  the  length  of  w,  can  be  converted  to 
a  Turing-machine  program  using  the  construction  of  Theorem  3.8.1.  This  program  ex¬ 
ecutes  in  a  time  cubic  in  the  time  of  the  original  program  on  the  RAM.  We  now  show 
that  every  language  in  NP  can  be  reduced  to  SATISFIABILITY  via  a  polynomial-time  pro¬ 
gram. 

Given  an  instance  of  CIRCUIT  SAT,  as  we  now  show,  we  can  convert  the  circuit  descrip¬ 
tion,  a  straight-line  program  (see  Section  2.2),  into  an  instance  of  SATISFIABILITY  such  that 
the  former  is  a  “yes”  instance  of  CIRCUIT  SAT  if  and  only  if  the  latter  is  a  “yes”  instance 
of  SATISFIABILITY.  Shown  below  are  the  different  steps  of  a  straight-line  program  and  the 
clauses  used  to  replace  them  in  constructing  an  instance  of  SATISFIABILITY.  A  determinis¬ 
tic  TM  can  be  designed  to  make  these  translations  in  time  proportional  to  the  length  of  the 
circuit  description.  Clearly  the  instance  of  SATISFIABILITY  that  it  produces  is  a  satisfiable 
instance  if  and  only  if  the  instance  of  CIRCUIT  SAT  is  satisfiable. 


Step  Type  Corresponding  Clauses 


(* 

READ 

x) 

(Vi  V 

x) 

(gi  V  a?) 

(i 

NOT 

j) 

{Vi  v 

ft) 

(ft  V  gj) 

(* 

OR 

3 

k) 

©  V 

ft) 

(ft  V  gk) 

(ft  v  ft 

V  ftc) 

(* 

AND 

3 

k) 

{Vi  v 

ft) 

(Vi  v  gk) 

(ft  V  g3 

VftJ 

(* 

OUTPUT 

3) 

(ft) 

For  each  gate  type  it  is  easy  to  see  that  each  of  the  corresponding  clauses  is  satisfiable 
only  for  those  gate  and  argument  values  that  are  consistent  with  the  type  of  gate.  For  ex¬ 
ample,  a  NOT  gate  with  input  Qj  has  value  gi  =  1  when  gj  has  value  0  and  gi  =  0  when 
gj  has  value  1.  In  both  cases,  both  of  the  clauses  (~gi  V  gj)  and  ( gi  V  gj)  are  satisfied. 
Flowever,  if  gi  is  equal  to  gj,  at  least  one  of  the  clauses  is  not  satisfied.  Similarly,  if  gi 
is  the  AND  of  gj  and  gk,  then  examining  all  eight  values  for  the  triple  ( ft,ft>fts )  shows 
that  only  when  gi  is  the  AND  of  g3  and  gk  are  all  three  clauses  satisfied.  The  verification 
of  the  above  statements  is  left  as  a  problem  for  the  reader.  (See  Problem  3.36.)  Since  the 
output  clause  (gj)  is  true  if  and  only  if  the  circuit  output  has  value  1,  it  follows  that  the 
set  of  clauses  are  all  satisfiable  if  and  only  if  the  circuit  in  question  has  value  1 ;  that  is,  it  is 
satisfiable. 

Given  an  instance  of  CIRCUIT  SAT,  clearly  a  deterministic  TM  can  produce  the  clauses 
corresponding  to  each  gate  using  a  temporary  storage  space  that  is  logarithmic  in  the  length 
of  the  circuit  description  because  it  need  deal  only  with  integers  that  are  linear  in  the  length 
of  the  input.  Thus,  each  instance  of  CIRCUIT  SAT  can  be  translated  into  an  instance  of 
SATISFIABILITY  in  a  number  of  steps  polynomial  in  the  length  of  the  instance  of  CIRCUIT 
SAT.  Since  it  is  also  in  NP,  it  is  NP-complete.  ■ 
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3.9.7  An  Efficient  Circuit  Simulation  of  TM  Computations* 

In  this  section  we  construct  a  much  more  efficient  circuit  of  size  0(Tb  log  to)  that  simulates 
a  computation  done  in  T  steps  by  an  m-word,  6-bit  one-tape  TM.  A  similar  result  on  circuit 
depth  is  shown. 

THEOREM  3.9.8  Let  an  m-word,  b-bit  Turing  machine  compute  in  T  steps  the  function  f,  a 
projection  of  /tm™  the  function  computed  by  the  TM  in  T  steps.  Then  the  following  bounds  on 
the  size  and  depth  of  f  over  the  complete  basis  O  must  be  satisfied: 

Cn(f)  =  O  (T(log[min(6T,  5)]) 

Da(f)  =  0(T) 

Proof  The  circuit  Cm, t  described  in  Theorem  3.9.1  has  size  proportional  to  O(ST),  where 
S  =  mb.  We  now  show  that  a  circuit  computing  the  same  function,  can  be 

constructed  whose  size  is  O  (T(log[min(6T,  S')]).  This  new  circuit  is  obtained  by  more 
efficiently  simulating  the  tape  unit  portion  of  a  Turing  machine.  We  observe  that  if  the  head 
never  reaches  a  cell,  the  cell  circuit  of  Fig.  3.26  can  be  replaced  by  wires  that  pass  its  inputs 
to  its  output.  It  follows  that  the  number  of  gates  can  be  reduced  if  we  keep  the  head  near 
the  center  of  a  simulated  tape  by  “centering”  it  periodically.  This  is  the  basis  for  the  circuit 
constructed  here. 

It  simplifies  the  design  of  N(l,T,m)  to  assume  that  the  tape  unit  has  cells  indexed 
from  — m  to  to.  Since  the  head  is  initially  placed  over  the  cell  indexed  with  0,  it  is  over 
the  middle  cell  of  the  tape  unit.  (The  control  unit  is  designed  so  that  the  head  never  enters 
cells  whose  index  is  negative.)  We  construct  N(1,T,  to)  from  a  subcircuit  N(c,  s,  n)  that 
simulates  s  steps  of  a  tape  unit  containing  n  6-bit  cells  under  the  assumption  that  the  tape 
head  is  initially  over  one  of  the  middle  c  cells  where  c  and  n  are  odd.  Here  n  >  c  +  2s,  so 
that  in  s  steps  the  head  cannot  move  from  one  of  the  middle  c  cells  to  positions  that  are  not 
simulated  by  this  circuit.  Let  C(c,  s,  n)  and  D(c,  s,  n)  be  the  size  and  depth  of  N(c,  s,  n). 

As  base  cases  for  our  recursive  construction  of  N(c,  s,  n) ,  consider  the  circuits  N(  1 , 1,3) 
and  N( 3,  1,  5).  They  can  be  constructed  from  copies  of  the  tape  circuit  Cfi 3)  and  Cfi 5) 
since  they  simulate  one  step  of  tape  units  containing  three  and  five  cells,  respectively.  In  fact, 
these  circuits  can  be  simplified  by  removing  unused  gates.  Without  simplification  Cfin) 
contains  5(6  +  1)  gates  in  each  of  the  n  cell  circuits  (see  Fig.  3.26)  as  well  as  [n  —  1)6  gates 
in  the  vector  OR  circuit,  for  a  total  of  at  most  6n(6  +  1)  gates.  It  has  depth  4  +  |~log2  n\ . 
Thus,  N(  1, 1,  3)  and  N( 3,  1,  5)  each  can  be  realized  with  0(6)  gates  and  depth  0(1). 

We  now  give  a  recursive  construction  of  a  circuit  that  simulates  a  tape  unit.  The 
N(  1 , 2 q,  4(7+1)  circuit  simulates  2 q  steps  of  the  tape  unit  when  the  head  is  over  the  middle 
cell.  It  can  be  decomposed  into  an  N(  1,  q,  2 q  +  1)  circuit  simulating  the  first  q  steps  and 
an  N(2q  +  1,  q,  Aq  +  1)  circuit  simulating  the  second  q  steps,  as  shown  in  Fig.  3.29.  In 
the  N(\,q,2q  +  1)  circuit,  the  head  may  move  from  the  middle  position  to  any  one  of 
2q  +  1  positions  in  q  steps,  which  requires  that  2q  +  1  of  the  inputs  be  supplied  to  it.  In  the 
N(2q  +  1,  (7,  Aq  +  1)  circuit,  the  head  starts  in  the  middle  2q  +  1  positions  and  may  move 
to  any  one  of  Aq  +  1  middle  positions  in  the  next  q  steps,  which  requires  that  4q+  1  inputs 
be  supplied  to  it.  The  size  and  depth  of  our  iV(l,  2q,  Aq  +  1)  circuit  satisfy  the  following 


recurrences: 
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Figure  3.29  A  decomposition  of  an  N(\,2q,  4q  +  1)  circuit. 


C(  1,  2 q,  4q  +1)  <  C(l,  q,  2 q  +  1)  +  C(2q  +  1,  q,  4 q  +1) 

D(l,2q,4q+1)  <  D(l,  q,  2q  +  1)  +  D(2q  +  1,  q,  4q  +  1) 


When  the  number  of  tape  cells  is  bounded,  the  above  construction  and  recurrences 
can  be  modified.  Let  m  =  2P  be  the  maximum  number  of  cells  used  during  a  T-step 
computation  by  the  TM.  We  simulate  this  computation  by  placing  the  head  over  the  middle 
of  a  tape  with  2m  +  1  cells.  It  follows  that  at  least  m  steps  are  needed  to  reach  each  of  the 
reachable  cells.  Thus,  if  T  <  m,  we  can  simulate  the  computation  with  an  iV(l,  T,  2 T  +  1) 
circuit.  If  T  >  m,  we  can  simulate  the  first  m  steps  with  an  N(  1,  m,  2m  +  1)  circuit  and 
the  remaining  T  —  m  steps  with  \{T  —m) /m]  copies  of  an  N(2m  +  1,  m,  4 m+  1)  circuit. 
This  follows  because  at  the  end  of  the  first  m  steps  the  head  is  over  the  middle  2 to  +  1  of 
4m  +  1  cells  (of  which  only  2m  +  1  are  used)  and  remains  in  this  region  after  m  steps  due 
to  the  limitation  on  the  number  of  cells  used  by  the  TM. 

From  the  above  discussion  we  have  the  following  bounds  on  the  size  C (T,  m)  and  depth 
D(T,  m)  of  a  simulating  circuit  for  a  T-step,  m-word  TM  computation: 


C(T,m)  < 


D(T,  to)  < 


C(1,T,2T+  1)  T  <  to 

C(l,m,2m  +  1)  +  (|~^"|  —  l)  C(2m+  1,to, 4m+  1)  T  >  to 

(3.5) 

T>(1,T,2T+1)  T  <  to 

D(l,  m,  2 to  +  1)  +  ( I©]  _  ij  D(2m  +  1,  to,  4m  +1)  T  >  m 


We  complete  the  proof  of  Theorem  3.9.8  by  bounding  C(l,2q,4q  +  1),  C(2q  + 
l,q,4q+  1),  D(l,2q,4q  +  1),  and  D(2q  +  1  ,q,4q+  1)  appearing  in  (3.4)  and  com¬ 
bining  them  with  the  bounds  of  (3.5). 

We  now  give  a  recursive  construction  of  an  N(2q  +  1,  q,  4q  +  1)  circuit  from  which 
these  bounds  are  derived.  Shown  in  Fig.  3.30  is  the  recursive  decomposition  of  an  N(4t  + 
1 , 2f ,  8f  + 1 )  circuit  in  terms  of  two  copies  of  N(2t- + 1 ,  t ,  4t + 1 )  circuits.  The  f-centering  cir¬ 
cuits  detect  whether  the  head  is  in  positions  2t,  2t  —  1, . . . ,  1,  0  or  in  positions  —  1, . . . ,  —2 1. 
In  the  former  case,  this  circuit  cyclically  shifts  the  8t  +  1  inputs  inputs  down  by  t  positions; 
in  the  latter,  it  cyclically  shifts  them  up  by  t  positions.  The  result  is  that  the  head  is  centered 
in  the  middle  2t  +  1  positions.  The  OR  of  s_i, . .  . ,  S-2t  can  be  used  as  a  signal  to  determine 
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Figure  3.30  A  recursive  decomposition  of  N(4t  +  1, 2t,  8t  +  1). 


which  shift  to  take.  After  centering,  t  steps  are  simulated,  the  head  is  centered  again,  and 
another  t  steps  are  again  simulated.  Two  f-correction  circuits  cyclically  shift  the  results  in 
directions  that  are  the  reverse  of  the  first  two  shifts.  This  circuit  correctly  simulates  the  tape 
computation  over  2t  steps  and  produces  an  N(4t  +  1,  2 1,  8 1  +  1)  circuit. 

A  f-centering  circuit  can  be  realized  as  a  single  stage  of  the  cyclic  shift  circuit  described 
in  Section  2.5.2  and  shown  in  Fig.  2.8.  A  f-correcdon  circuit  is  just  a  f-centering  circuit 
in  which  the  shift  is  in  the  reverse  direction.  The  four  shifting  circuits  can  be  realized  with 
0(tb)  gates  and  constant  depth.  The  two  OR  trees  to  determine  the  direction  of  the  shift  can 
be  realized  with  0(t)  gates  and  depth  O(logf).  From  this  discussion  we  have  the  following 
bounds  on  the  size  and  depth  of  N(4t  +  1,  2 1,  8f  +  1): 

C(4t  +  1, 2t,  8t  +  1)  <  2C(2t,  +  1,  t,  4t+l)  +  0(bt) 

C( 3,1,5)  <  0(6) 

D{4t+  l,2f,8f  +  1)  <  2D(2t+  1  ,t,4t+  1)  +2|"log2£] 

£>(3,1,5)  <  0(1) 

We  now  solve  this  set  of  recurrences.  Let  C(k)  =  C(2t  +  l,i,  4t  +  1)  and  T>{k)  = 
D(2t+ 1,  t,  4t+ 1)  when  t  =  2k .  The  above  bounds  translate  into  the  following  recurrences: 

C(k  +  1)  <  2 C{k)  +  K{2k  +  K2 
C(0)  <  K3 

T>(k  +  1)  <  2 T>(k)  +  2k  +  K4 
V{0)  <  K5 

for  constants  K\,  K2,  K$,  K4,  and  K$.  It  is  straightforward  to  show  that  C(k  +  1)  and 
V(k  +  1)  satisfy  the  following  inequalities: 

C{k)  <  2k(K1k/2  +  I<2  +  K3)  -  K2 
V(k)  <  2 k{K5  +  KA  +  2)-2k-  (K4  +  2) 
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We  now  derive  explicit  upper  bounds  to  (3.4).  Let  A (k)  =  (7(1,  q,  2 q+  1)  and  A (fc)  = 
D(  1,  q,  2 q  +  1)  when  q  =  2k .  Then,  the  inequalities  of  (3.4)  become  the  following: 

A(fc  +  1)  <  A(fc)  +  C(k) 

A(O)  <  K6 

A(fc+  1)  <  A (k)  +  V(k) 

A(O)  <  K7 

where  Ag  =  (7(1, 1,3)  =  76  +  3  and  K7  =  D{  1, 1,3)  =  4.  The  solutions  to  these 
recurrences  are  given  below. 

k—1 

m  < 

3=0 

=  2k(K\k/2  +  K2  +  Ki  -  AT)  -  kK2  +  (K6  -  (K2  +  K3  -  A©) 

=  0(k2k) 

k — 1 

A(*)< 

3=0 

=  2k  (AT5  +  A4  +  2)  —  A:2  +  (l  —  ( A^4  +  2))k  +  (A7  —  (^5  +  A^4  +  2)) 

=  0(2k) 

Here  we  have  made  use  of  the  identity  in  Problem  3.1.  From  (3.5)  and  (3.6)  we  establish 
the  result  of  Theorem  3.9.8.  ■ 


3.10  Design  of  a  Simple  CPU 

In  this  section  we  design  an  eleven-instruction  CPU  for  a  general-purpose  computer  that  has  a 
random-access  memory  with  2U  16-bit  memory  words.  We  use  this  design  to  illustrate  how  a 
general-purpose  computer  can  be  assembled  from  gates  and  binary  storage  devices  (flip-flops). 
The  design  is  purposely  kept  simple  so  that  basic  concepts  are  made  explicit.  In  practice, 
however,  CPU  design  can  be  very  complex.  Since  the  CPU  is  the  heart  of  every  computer,  a 
high  premium  is  attached  to  making  them  fast.  Many  clever  ideas  have  been  developed  for  this 
purpose,  almost  all  of  which  we  must  for  simplicity  ignore  here. 

Before  beginning,  we  note  that  a  typical  complex  instruction  set  (CISC)  CPU,  one  with 
a  rich  set  of  instructions,  contains  several  tens  of  thousands  of  gates,  while  as  shown  in  the 
previous  section,  a  random-access  memory  unit  has  a  number  of  equivalent  gates  proportional 
to  its  memory  capacity  in  bits.  (CPUs  are  often  sold  with  caches,  small  random-access  memory 
units  that  add  materially  to  the  number  of  equivalent  gates.)  The  CPUs  of  reduced  instruction 
set  (RISC)  computers  have  many  fewer  gates.  By  contrast,  a  four-megabyte  memory  has  the 
equivalent  of  several  tens  of  millions  of  gates.  As  a  consequence,  the  size  and  depth  of  the 
next-state  and  output  functions  of  the  random-access  memory,  4rmem  and  Armem  ,  typically 
dominate  the  size  and  depth  of  the  next-state  and  output  functions,  <5cpu  and  Acpu>  of  the 
CPU,  as  shown  in  Theorem  3.6.1. 
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3.10.1  The  Register  Set 

A  CPU  is  a  sequential  circuit  that  repeatedly  reads  and  executes  an  instruction  from  its  memory 
in  what  is  known  as  the  fetch-and-execute  cycle.  (See  Sections  3.4  and  3.10.2.)  A  machine- 
language  program  is  a  set  of  instructions  drawn  from  the  instruction  set  of  the  CPU.  In  our 
simple  CPU  each  instruction  consists  of  two  parts,  an  opcode  and  an  address,  as  shown 
schematically  below. 


14  5  16 


Opcode 


Address 


Since  our  computer  has  eleven  instructions,  we  use  a  4-bit  opcode,  a  length  sufficient  to 
represent  all  of  them.  Twelve  bits  remain  in  the  16-bit  word,  providing  addresses  for  4,096 
16-bit  words  in  a  random-access  memory. 

We  let  our  CPU  have  eight  special  registers:  the  16-bit  accumulator  (AC),  the  12-bit 
program  counter  (PC),  the  4-bit  opcode  register  (OPC),  the  12-bit  memory  address  register 
(MAR),  the  16-bit  memory  data  register  (MDR),  the  16-bit  input  register  (INR),  the  16- 
bit  output  register  (denoted  OUTR),  and  the  halt  register  (HLT).  These  registers  are  shown 
schematically  together  with  the  random-access  memory  in  Fig.  3.31. 

The  program  counter  PC  contains  the  address  from  which  the  next  instruction  will  be 
fetched.  Normally  this  is  the  address  following  the  address  of  the  current  instruction.  However, 
if  some  condition  is  true,  such  as  that  the  contents  of  the  accumulator  AC  are  zero,  the  program 
might  place  a  new  address  in  the  PC  and  jump  to  this  new  address.  The  memory  address 
register  MAR  contains  the  address  used  by  the  random-access  memory  to  fetch  a  word.  The 
memory  data  register  MDR  contains  the  word  fetched  from  the  memory.  The  halt  register 
HLT  contains  the  value  0  if  the  CPU  is  halted  and  otherwise  contains  1 . 


Figure  3.3  I  Basic  registers  of  the  simple  CPU  and  the  paths  connecting  them.  Also  shown 
is  the  arithmetic  logic  unit  (ALU)  containing  circuits  for  AND,  addition,  shifting,  and  Boolean 
complement. 
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3.10.2  The  Fetch-and-Execute  Cycle 

The  fetch-and-execute  cycle  has  a  fetch  portion  and  an  execution  portion.  The  fetch  portion 
is  always  the  same:  the  instruction  whose  address  is  in  the  PC  is  fetched  into  the  MDR  and 
the  opcode  portion  of  this  register  is  copied  into  the  OPC.  At  this  point  the  action  of  the  CPU 
diverges,  based  on  the  instruction  denoted  by  the  value  of  the  OPC.  Suppose,  for  example, 
that  the  OPC  denotes  a  load  accumulator  instruction.  The  action  required  is  to  copy  the  word 
specified  by  the  address  part  of  the  instruction  into  the  accumulator.  Fig.  3.32  contains  a  de¬ 
composition  of  the  load  accumulator  instruction  into  eight  microinstructions  executed  in  six 
microcycles.  During  each  microcycle  several  microinstructions  can  be  executed  concurrently, 
as  shown  in  the  table  for  the  second  and  fourth  microcycles.  In  Section  3.10.5  we  describe 
implementations  of  the  fetch-and-execute  cycle  for  each  of  the  instructions  of  our  computer. 

It  is  important  to  note  that  a  realistic  CPU  must  do  more  than  fetch  and  execute  instruc¬ 
tions:  it  must  be  interruptable  by  a  user  or  an  external  device  that  demands  its  attention.  After 
fetching  and  executing  an  instruction,  a  CPU  typically  examines  a  small  set  of  flip-flops  to  see 
if  it  must  break  away  from  the  program  it  is  currently  executing  to  handle  an  interrupt,  an 
action  equivalent  to  fetching  an  instruction  associated  with  the  interrupt.  This  action  causes 
an  interrupt  routine  to  be  run  that  responds  to  the  problem  associated  with  the  interrupt,  after 
which  the  CPU  returns  to  the  program  it  was  executing  when  it  was  interrupted.  It  can  do 
this  by  saving  the  address  of  the  next  instruction  of  this  program  (the  value  of  the  PC)  at  a 
special  location  in  memory  (such  as  address  0).  After  handling  the  interrupt,  it  branches  to 
this  address  by  reloading  PC  with  the  old  value. 

3.10.3  The  Instruction  Set 

Figure  3.33  lists  the  eleven  instructions  of  our  simple  CPU.  The  first  group  consists  of  arith¬ 
metic  (see  Section  2.7),  logic,  and  shift  instructions  (see  Section  2.5.1).  The  circulate  in¬ 
struction  executes  a  cyclic  shift  of  the  accumulator  by  one  place.  The  second  group  consists 
of  instructions  to  move  data  between  the  accumulator  and  memory.  The  third  set  contains 
a  conditional  jump  instruction:  when  the  accumulator  is  zero,  it  causes  the  CPU  to  resume 
fetching  instructions  at  a  new  address,  the  address  in  the  memory  data  register.  This  address 
is  moved  to  the  program  counter  before  fetching  the  next  instruction.  The  fourth  set  contains 
input/output  instructions.  The  fifth  set  contains  the  halt  instruction.  Many  more  instruc- 


Cycle  Microinstruction  Microinstruction 

1  Copy  contents  of  PC  to  MAR. 

2  Fetch  word  at  address  MAR  into  MDR.  Increment  PC. 

3  Copy  opcode  part  of  MDR  to  OPC. 

4  Interpret  OPC  Copy  address  part  of  MDR 

to  MAR. 

5  Fetch  word  at  address  MAR  into  MDR. 

6  Copy  MDR  into  AC. 


Figure  3.32  Decomposition  of  the  load  accumulator  instruction  into  eight  microinstructions 
in  six  microcycles. 
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Opcode 

Binary 

Description 

Arithmetic 

Logic 

ADD 

AND 

CLA 

CMA 

CIL 

0000 

0001 

0010 

0011 

0100 

Add  memory  word  to  AC 

AND  memory  word  to  AC 

Clear  (set  to  zero)  the  accumulator 
Complement  AC 

Circulate  AC  left 

Memory 

LDA 

STA 

0101 

0110 

Load  memory  word  into  AC 

Store  AC  into  memory  word 

Jump 

JZ 

0111 

Jump  to  address  if  AC  zero 

I/O 

IN 

OUT 

1000 

1001 

Load  INR  into  AC 

Store  AC  into  OUTR 

Halt 

HLT 

1010 

Halt  computer 

Figure  3.33  Instructions  of  the  simple  CPU. 


tions  could  be  added,  including  ones  to  simplify  the  execution  of  subroutines,  handle  loops, 
and  process  interrupts.  Each  instruction  has  a  mnemonic  opcode,  such  as  CLA,  and  a  binary 
opcode,  such  as  0010. 

Many  other  operations  can  be  performed  using  this  set,  including  subtraction,  which 
can  be  realized  through  the  use  of  ADD,  CMA,  and  two’s-complement  arithmetic  (see  Prob¬ 
lem  3.18).  Multiplication  is  also  possible  through  the  use  of  CIL  and  ADD  (see  Problem  3.38). 
Since  multiple  CILs  can  be  used  to  rotate  right  one  place,  division  is  also  possible.  Finally,  as 
observed  in  Problem  3.39,  every  two-input  Boolean  function  can  be  realized  through  the  use 
of  AND  and  CMA.  This  implies  that  every  Boolean  function  can  be  realized  by  this  machine 
if  it  is  designed  to  address  enough  memory  locations. 

Each  of  these  instructions  is  a  direct  memory  instruction,  by  which  we  mean  that  all 
addresses  refer  directly  to  memory  locations  containing  the  operands  (data)  on  which  the  pro¬ 
gram  operates.  Most  CPUs  also  have  indirect  memory  instructions  (and  are  said  to  support 
indirection) .  These  are  instructions  in  which  an  address  is  interpreted  as  the  address  at  which 
to  find  the  address  containing  the  needed  operand.  To  find  such  an  indirect  operand,  the  CPU 
does  two  memory  fetches,  the  first  to  find  the  address  of  the  operand  and  the  second  to  find 
the  operand  itself.  Often  a  single  bit  is  added  to  an  opcode  to  denote  that  an  instruction  is  an 
indirect  memory  instruction. 

An  instruction  stored  in  the  memory  of  our  computer  consists  of  sixteen  binary  digits,  the 
first  four  denoting  the  opcode  and  the  last  twelve  denoting  an  address.  Because  it  is  hard  for 
humans  to  interpret  such  machine-language  statements,  mnemonic  opcodes  and  assembly 
languages  have  been  devised. 

3.10.4  Assembly- Language  Programming 

An  assembly-language  program  consists  of  a  number  of  lines  each  containing  either  a  real  or 
pseudo-instruction.  Real  instructions  correspond  exactly  to  machine-language  instructions  ex¬ 
cept  that  they  contain  mnemonics  and  symbolic  addresses  instead  of  binary  sequences.  Pseudo- 
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instructions  are  directions  to  the  assembler,  the  program  that  translates  an  assembly-language 
program  into  machine  language.  A  typical  pseudo-instruction  is  ORG  100,  which  instructs 
the  assembler  to  place  the  following  lines  at  locations  beginning  with  location  100.  Another 
example  is  the  DAT  pseudo-instruction  that  identifies  a  word  containing  only  data.  The  END 
pseudo-instruction  identifies  the  end  of  the  assembly-language  program. 

Each  assembly-language  instruction  fits  on  one  line.  A  typical  instruction  has  the  following 
fields,  some  or  all  of  which  may  be  used. 


Symbolic_Address 

Mnemonic 

Address 

Indirect  Bit 

Comment 

If  an  instruction  has  a  Symbolic  Address  (a  string  of  symbols),  the  address  is  converted 
to  the  physical  address  of  the  instruction  by  the  assembler  and  substituted  for  all  uses  of  the 
symbolic  address.  The  Address  field  can  contain  one  or  more  symbolic  or  real  addresses,  al¬ 
though  the  assembly  language  used  here  allows  only  one  address.  The  Indirect  Bit  specifies 
whether  or  not  indirection  is  to  be  used  on  the  address  in  question.  In  our  CPU  we  do  not 
allow  indirection,  although  we  do  allow  it  in  our  assembly  language  because  it  simplifies  our 
sample  program. 

Let’s  now  construct  an  assembly-language  program  whose  purpose  is  to  boot  up  a  computer 
that  has  been  reset.  The  boot  program  reads  another  program  provided  through  its  input  port 
and  stores  this  new  program  (a  sequence  of  16-bit  words)  in  the  memory  locations  just  above 
itself.  When  it  has  finished  reading  this  new  program  (determined  by  reading  a  zero  word), 
it  transfers  control  to  the  new  program  by  jumping  to  the  first  location  above  itself.  When 
computers  are  turned  off  at  night  they  need  to  be  rebooted,  typically  by  executing  a  program 
of  this  kind. 

Figure  3.34  shows  a  program  to  boot  up  our  computer.  It  uses  three  symbolic  addresses, 
ADDR_1,  ADDR_2,  ADDR_3,  and  one  real  address,  10.  We  assume  this  program  resides 


ORG 

0 

Program  is  stored  at  location  0. 

ADDR_1 

IN 

Start  of  program. 

JZ 

10 

Transfer  control  if  AC  zero. 

STA 

ADDR_2 

I 

Indirect  store  of  input. 

LDA 

ADDR_2 

Start  incrementing  ADDR_2. 

ADD 

ADDR_3 

Finish  incrementing  of  ADDR_2. 

STA 

ADDR_2 

Store  new  value  of  ADDR_2. 

CLA 

Clear  AC. 

JZ 

ADDR  1 

Jump  to  start  of  program. 

ADDRAZ 

DAT 

10 

Address  for  indirection. 

ADDR_3 

DAT 

END 

1 

Value  for  incrementing. 

Figure  3.34  A  program  to  reboot  a  computer. 
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permanently  in  locations  0  through  9  of  the  memory.  After  being  reset,  the  CPU  reads  and 
executes  the  instruction  at  location  0  of  its  memory. 

The  first  instruction  of  this  program  after  the  ORG  statement  reads  the  value  in  the  input 
register  into  the  accumulator.  The  second  instruction  jumps  to  location  10  if  the  accumulator 
is  zero,  indicating  that  the  last  word  of  the  second  program  has  been  written  into  the  memory. 
If  this  happens,  the  next  instruction  executed  by  the  CPU  is  at  location  10;  that  is,  control  is 
transferred  to  the  second  program.  If  the  accumulator  is  not  zero,  its  value  is  stored  indirectly  at 
location  ADDR_2.  (We  explain  the  indirect  STA  in  the  next  paragraph.)  On  the  first  execution 
of  this  command,  the  value  of  ADDR_2  is  10,  so  that  the  contents  of  the  accumulator  are 
stored  at  location  10.  The  next  three  steps  increment  the  value  of  ADDR_2  by  placing  its 
contents  in  the  accumulator,  adding  the  value  in  location  ADDR_3  to  it,  namely  1 ,  and  storing 
the  new  value  into  location  ADDR_2.  Finally,  the  accumulator  is  zeroed  and  a  JZ  instruction 
used  to  return  to  location  ADDR_1,  the  first  address  of  the  boot  program. 

The  indirect  STA  instruction  in  this  program  is  not  available  in  our  computer.  However, 
as  shown  in  Problem  3.42,  this  instruction  can  be  simulated  by  a  self-modifying  subprogram. 
While  it  is  considered  bad  programming  practice  to  write  self-modifying  programs,  this  exer¬ 
cise  illustrates  the  power  of  self-modification  as  well  as  the  advantage  of  having  indirection  in 
the  instruction  set  of  a  computer. 

3.10.5  Timing  and  Control 

Now  that  the  principles  of  a  CPU  have  been  described  and  a  programming  example  given,  we 
complete  the  description  of  a  sequential  circuit  realizing  the  CPU.  To  do  this  we  need  to  de¬ 
scribe  circuits  controlling  the  combining  and  movement  of  data.  To  this  end  we  introduce  the 
assignment  notation  in  Fig.  3.35.  Here  the  expression  AC  < —  MDR  means  that  the  contents 
of  MDR  are  copied  into  AC,  whereas  AC  < —  AC  +  MDR  means  that  the  contents  of  AC  and 
MDR  are  added  and  the  result  assigned  to  AC.  In  all  cases  the  left  arrow,  ,  signifies  that 
the  result  or  contents  on  the  right  are  assigned  to  the  register  on  the  left.  However,  when  the 
register  on  the  left  contains  information  of  a  particular  type,  such  as  an  address  in  the  case  of 
PC  or  an  opcode  in  the  case  of  OPC,  and  the  register  on  the  right  contains  more  information, 
the  assignment  notation  means  that  the  relevant  bits  of  the  register  on  the  right  are  loaded 
into  the  register  on  the  left.  For  example,  the  assignment  PC  < —  MDR  means  that  the  address 
portion  of  MDR  is  copied  to  PC. 

Register  transfer  notation  uses  these  assignment  operations  as  well  as  timing  information 
to  break  down  a  machine-level  instruction  into  microinstructions  that  are  executed  in  succes- 


Notation 

Explanation 

AC  <-  MDR 

Contents  of  MDR  loaded  into  AC. 

AC  <—  AC  +  MDR 

Contents  of  MDR  added  to  AC. 

MDR  <—  M 

Contents  of  memory  location  MAR  loaded  into  MDR. 

M  <—  MDR 

Contents  of  MDR  stored  at  memory  location  MAR. 

PC  <-  MDR 

Address  portion  of  MDR  loaded  into  PC. 

MAR  <—  PC 

Contents  of  PC  loaded  into  MAR. 

Figure  3.35  Microinstructions  illustrating  assignment  notation. 


©John  E  Savage 


3.10  Design  of  a  Simple  CPU 


143 


Tuning 

Microinstructions 

ti 

MAR «- 

-  PC 

^2 

MDR  + 

-M,  PC^-PC+1 

h 

OPC  v- 

-  MDR 

Figure  3.36  The  microcode  for  the  fetch  portion  of  each  instruction. 


sive  microcycles.  The  jth  microcycle  is  specified  by  the  timing  variable  tj,  1  <  j  <  k.  That 
is,  tj  is  1  during  the  jth  microcycle  and  is  zero  otherwise.  It  is  straightforward  to  show  that 
these  timing  variables  can  be  realized  by  connecting  a  decoder  to  the  outputs  of  a  counting 
circuit,  a  circuit  containing  the  binary  representation  of  an  integer  that  increments  the  integer 
modulo  some  other  integer  on  each  clock  cycle.  (See  Problem  3.40.) 

Since  the  fetch  portion  of  each  instruction  is  the  same,  we  write  a  few  lines  of  register 
transfer  notation  for  it,  as  shown  in  Fig.  3.36.  On  the  left-hand  side  of  each  line  is  timing 
variable  indicating  the  cycle  during  which  the  microinstruction  is  executed. 

The  microinstructions  for  the  execute  portion  of  each  instruction  of  our  computer  are 
shown  in  Fig.  3.37.  On  the  left-hand  side  of  each  line  is  a  timing  variable  that  must  be  ANDed 
with  the  indicated  instruction  variable,  such  as  cadd>  which  is  1  if  that  instruction  is  in 
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CSTA 

U 

MDR  —  AC 

CADD 

U 

AC  <— 

AC  +  MDR 
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CLA 
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Figure  3.37  The  execute  portions  of  the  microcode  of  instructions. 
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the  opcode  register  OPC  and  0  otherwise.  These  instruction  variables  can  be  generated  by  a 
decoder  attached  to  the  output  of  OPC.  Here  denotes  the  complement  of  the  accumulator. 

Now  that  we  understand  how  to  combine  microinstructions  in  microcycles  to  produce 
macroinstructions,  we  use  this  information  to  define  control  variables  that  control  the  move¬ 
ment  of  data  between  registers  or  combine  the  contents  of  two  registers  and  assign  the  result 
to  another  register.  This  information  will  be  used  to  complete  the  design  of  the  CPU. 

We  now  introduce  notation  for  control  variables.  If  a  microinstruction  results  in  the 
movement  of  data  from  register  B  to  register  A,  denoted  A  <—  B  in  our  assignment  nota¬ 
tion,  we  associate  the  control  variable  L(A,  B)  with  it.  If  a  microinstruction  results  in  the 
combination  of  the  contents  of  registers  B  and  C  with  the  operation  0  and  the  assignment 
of  the  result  to  register  A,  denoted  A  <—  B  ©  C  in  our  assignment  notation,  we  associate 
the  control  variable  L(A,  B  0  C)  with  it.  For  example,  inspection  of  Figs.  3.36  and  3.37 
shows  that  we  can  write  the  following  expressions  for  the  control  variables  L(OPC,  MDR) 
and  L(AC,  AC+MDR): 

L(OPC,MDR)  =  t3 
L(AC,  AC+MDR)  =  cadd  A  £g 

Thus,  OPC  is  loaded  with  the  contents  of  MDR  when  £3  =  1,  and  the  contents  of  AC  are 
added  to  those  of  MDR  and  copied  into  AC  when  cadd  A  £g  —  1. 

The  complete  set  of  control  variables  can  be  obtained  by  first  grouping  together  all  the  mi¬ 
croinstructions  that  affect  a  given  register,  as  shown  in  Fig.  3.38,  and  then  writing  expressions 
for  the  control  variables.  Here  M  denotes  the  memory  unit  and  HLT  is  a  special  register  that 
must  be  set  to  1  for  the  CPU  to  run.  Inspection  of  Fig.  3.38  leads  to  the  following  expressions 
for  control  variables: 


L(  AC,  AC  +  MDR) 
L(AC,  AC  AND  MDR) 
L( AC,  0) 
L{  AC,  Shift(AC)) 
A  (AC,  MDR) 
A  (AC,  INR) 
A  (AC,  AC) 
A  (MAR,  PC) 
A  (MAR,  MDR) 
A(MDR,  M) 
A  (MDR,  AC) 
A(M,  MDR) 
A(PC,  PC+1) 
A(PC,  MDR) 
A(OPC,  MDR) 
A(OUTR,  AC) 
-k(Aj') 


CADD  A  £g 
cand  A  tg 
CCLA  A  £4 
CCIL  A  £4 
clda  A  fg 
CIN  A  £4 
CCMA  A  t4 

t\ 

(cadd  V  cand  V  clda  V  csta)  A  £4 
£2  V  (cadd  V  cand  V  clda)  A  £5 
csta  A  £4 
csta  A  £5 
£2 

{AC  =  0)  A  cjz  A  £4 
£3 

COUT  A  £4 

chlt  A  £4  for  1  <  j  <  6 
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The  expression  (AC  =  0)  denotes  a  Boolean  variable  whose  value  is  1  if  all  bits  in  the  AC 
are  zero  and  0  otherwise.  This  variable  is  the  AND  of  the  complement  of  each  component  of 
register  AC. 

To  illustrate  the  remaining  steps  in  the  design  of  the  CPU,  we  show  in  Fig.  3.39  the 
circuits  used  to  provide  input  to  the  accumulator  AC.  Shown  are  registers  AC,  MDR,  and 
INR  as  well  as  circuits  for  the  functions  /add  (see  Section  2.7)  and  /and  that  add  two  bi¬ 
nary  numbers  and  take  their  AND,  respectively.  Also  shown  are  multiplexer  circuits  /mux  (see 
Section  2.5.5).  They  have  three  control  inputs,  Lq,  L\,  and  L2,  and  can  select  one  of  eight 
inputs  to  place  on  their  output  lines.  However,  only  seven  inputs  are  needed:  the  result  of 
adding  AC  and  MDR,  the  result  of  ANDing  AC  and  MDR,  the  zero  vector,  the  result  of  shift¬ 
ing  AC,  the  contents  of  MDR  or  INR,  and  the  complement  of  AC.  The  three  control  inputs 
encode  the  seven  control  variables,  L( AC,  AC  +  MDR),  L(AC,  AC  AND  MDR),  L( AC,  0), 
L{  AC,  Shift(AC)),  L{  AC,  MDR),  L(  AC,  INR),  and  L(  AC,  -hAC).  Since  at  most  one  of  these 
control  variables  has  value  1  at  any  one  time,  the  encoder  circuit  of  Section  2.5.3  can  be  used 
to  encode  these  seven  control  variables  into  the  three  bits  Lq,  L\,  and  L2  shown  in  Fig.  3.39. 

The  logic  circuit  to  supply  inputs  to  AC  has  size  proportional  to  the  number  of  bits  in  each 
register.  Thus,  if  the  word  size  of  the  CPU  were  scaled  up,  the  size  of  this  circuit  would  scale 
linearly  with  the  word  size. 
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Figure  3.39  Circuits  providing  input  to  the  accumulator  AC. 


The  circuit  for  the  program  counter  PC  can  be  designed  from  an  adder,  a  multiplexer,  and 
a  few  additional  gates.  Its  size  is  proportional  to  [log2  to] .  The  circuits  to  supply  inputs  to 
the  remaining  registers,  namely  MAR,  MDR,  OPC,  INR,  and  OUTR,  are  less  complex  to 
design  than  those  for  the  accumulator.  The  same  observations  apply  to  the  control  variable  to 
write  the  contents  of  the  memory.  The  complete  design  of  the  CPU  is  given  as  an  exercise  (see 
Problem  3.41). 

3.10.6  CPU  Circuit  Size  and  Depth 

Using  the  design  given  above  for  a  simple  CPU  as  a  basis,  we  derive  upper  bounds  on  the  size 
and  depth  of  the  next-state  and  output  functions  of  the  RAM  CPU  defined  in  Section  3.4. 

All  words  on  which  the  CPU  operates  contain  b  bits  except  for  addresses,  which  contain 
[log  m]  bits  where  to  is  the  number  of  words  in  the  random-access  memory.  We  assume  that 
the  CPU  not  only  has  an  [log  m]  -bit  program  counter  but  can  send  the  contents  of  the  PC 
to  the  MAR  of  the  random-access  memory  in  one  unit  of  time.  When  the  CPU  fetches  an 
instruction  that  refers  to  an  address,  it  may  have  to  retrieve  multiple  6-bit  words  to  create  an 
[log  m] -bit  address.  We  assume  the  time  for  such  operations  is  counted  in  the  number  T  of 
steps  that  the  RAM  takes  for  the  computation. 

The  arithmetic  operations  supported  by  the  RAM  CPU  include  addition  and  subtraction, 
operations  realized  by  circuits  with  size  and  depth  linear  and  logarithmic  respectively  in  b,  the 
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length  of  the  accumulator.  (See  Section  2.7.)  The  same  is  true  for  the  logical  vector  and  the 
shift  operations.  (See  Section  2.5.1.)  Thus,  circuits  affecting  the  accumulator  (see  Fig.  3.39) 
have  size  0(b)  and  depth  0(log&).  Circuits  affecting  the  opcode  and  output  registers  and 
the  memory  address  and  data  registers  are  simple  and  have  size  0(b)  and  depth  0(log&). 
The  circuits  affecting  the  program  counter  not  only  support  transfer  of  data  from  the  accu¬ 
mulator  to  the  program  counter  but  also  allow  the  program  counter  to  be  incremented.  The 
latter  function  can  be  performed  by  an  adder  circuit  whose  size  is  0([logro])  and  depth  is 
0(log  [log  m] ) .  It  follows  that 

Cn(^cpu)  =  0(b+  [log  to]) 

-Dq(^cpu)  =  0(log  b  +  log  [log  to]  ) 


3.10.7  Emulation 

In  Section  3.4  we  demonstrated  that  whatever  computation  can  be  done  by  a  finite-state  ma¬ 
chine  can  be  done  by  a  RAM  when  the  latter  has  sufficient  memory.  This  universal  nature  of 
the  RAM,  which  is  a  model  for  the  CPU  we  have  just  designed,  is  emphasized  by  the  problem 
of  emulation,  the  simulation  of  one  general-purpose  computer  by  another. 

Emulation  of  a  target  CPU  by  a  host  CPU  means  reading  the  instructions  in  a  program 
for  the  target  CPU  and  executing  host  instructions  that  have  the  same  effect  as  the  target 
instructions.  In  Problem  3.44  we  ask  the  reader  to  sketch  a  program  to  emulate  one  CPU 
by  another.  This  is  another  manifestation  of  universality,  this  time  for  unbounded-memory 
RAMs. 

Problems 

MATHEMATICAL  PRELIMINARIES 

3. 1  Establish  the  following  identity: 

k 

^j2^'=2((fc-l)2fc  +  l) 

3=0 

3.2  Let  p  :  IN  i— >  IN  and  q  :  IN  i— >  IN  be  polynomial  functions  on  the  set  IN  of  non¬ 
negative  integers.  Show  that  p(q(n))  is  also  a  polynomial  in  n. 

FINITE-STATE  MACHINES 

3.3  Describe  an  FSM  that  compares  two  binary  numbers  supplied  as  concurrent  streams  of 
bits  in  descending  order  of  importance  and  enters  a  rejecting  state  if  the  first  string  is 
smaller  than  the  second  and  an  accepting  state  otherwise. 

3.4  Describe  an  FSM  that  computes  the  threshold-two  function  on  n  Boolean  inputs  that 
are  supplied  sequentially  to  the  machine. 

3.5  Consider  the  full-adder  function  Vi>  Ci )  =  (<A+i,  si)  defined  below  where  + 

denotes  integer  addition: 


“F  Si  —  Xi  ~\~  Hi  Ci 


148  Chapter  3  Machines  with  Memory  Models  of  Computation 

Show  that  the  subfunction  of  fpA  obtained  by  fixing  Cj  =  0  and  deleting  Ci+\  is  the 
EXCLUSIVE  OR  of  the  variables  Xi  and  pi. 

3.6  It  is  straightforward  to  show  that  every  Moore  FSM  is  a  Mealy  FSM.  Given  a  Mealy 
FSM,  show  how  to  construct  a  Moore  FSM  whose  outputs  for  every  input  sequence  are 
identical  to  those  of  the  Mealy  FSM. 

3.7  Find  a  deterministic  FSM  that  recognizes  the  same  language  as  that  recognized  by  the 
nondeterministic  FSM  of  Fig.  3.8. 

3.8  Write  a  program  in  a  language  of  your  choice  that  writes  the  straight-line  program 
described  in  Fig.  3.3  for  the  FSM  of  Fig.  3.2  realizing  the  EXCLUSIVE  OR  function. 

SHALLOW  FSM  CIRCUITS 

3.9  Develop  a  representation  for  states  in  the  m-word,  b-bit  random-access  memory  so  that 
its  next-state  mappings  form  a  semigroup. 

Hint:  Show  that  the  information  necessary  to  update  the  current  state  can  be  succinctly 
described. 

3.10  Show  that  matrix  multiplication  is  associative. 

SEQUENTIAL  CIRCUITS 

3.11  Show  that  the  circuit  of  Fig.  3.15  computes  the  functions  defined  in  the  tables  of 
Fig.  3.14. 

Hint:  Section  2.2  provides  a  method  to  produce  a  circuit  from  a  tabular  description  of 
a  binary  function. 

3.12  Design  a  sequential  circuit  (an  electronic  lock)  that  enters  an  accepting  state  only  when 
it  receives  some  particular  four-bit  sequence  that  you  specify. 

3.13  Design  a  sequential  circuit  (a  modulo-p  counter)  that  increments  a  binary  number  by 
one  on  each  step  until  it  reaches  the  integer  value  p,  at  which  point  it  resets  its  value  to 
zero.  You  should  assume  that  p  is  not  a  power  of  2. 

3.14  Give  an  efficient  design  of  an  incrementing/decrementing  counter,  a  sequential  cir¬ 
cuit  that  increments  or  decrements  a  binary  number  modulo  2™.  Specify  the  machine 
as  an  FSM  and  determine  the  number  of  gates  in  the  sequential  circuit  in  terms  of  n. 

RANDOM-ACCESS  MACHINES 

3.15  Given  a  straight-line  program  for  a  Boolean  function,  describe  the  steps  taken  to  com¬ 
pute  it  during  fetch-and-execute  cycles  of  a  RAM.  Determine  whether  jump  instruc¬ 
tions  are  necessary  to  execute  such  programs. 

3.16  Consulting  Theorem  3.4.1,  determine  whether  jump  instructions  are  necessary  for  all 
RAM  computations.  If  not,  what  advantage  accrues  to  using  them? 

3.17  Sketch  a  RAM  program  using  time  and  space  0(n )  that  recognizes  strings  of  the  form 
{0mlm  |  1  <  m  <  n}. 
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ASSEMBLY-LANGUAGE  PROGRAMMING 

3.18  Write  an  assembly-language  program  in  the  language  of  Fig.  3.18  to  subtract  two  inte¬ 
gers. 

3.19  The  assembly-language  instructions  of  Fig.  3.18  operate  on  integers.  Show  that  the 
operations  AND,  OR,  and  NOT  can  be  realized  on  Boolean  variables  with  these  instruc¬ 
tions.  Show  also  that  these  operations  on  vectors  can  be  implemented. 

3.20  Write  an  assembly-language  program  in  the  language  of  Fig.  3.18  to  form  xv  for  inte¬ 
gers  x  and  y. 

3.21  Show  that  the  assembly-language  instructions  CLR  R^,  R;  <—  Rj,  JMP+  Nj,  and  JMP_ 
N,  can  be  realized  from  the  assembly-language  instructions  INC,  DEC,  CONTINUE, 
Rj  JMP+  N,;,  and  Rj  JMP_  N,. 


TURING  MACHINES 

3.22  In  a  standard  Turing  machine  the  tape  unit  has  a  left  end  but  extends  indefinitely  to  the 
right.  Show  that  allowing  the  tape  unit  to  be  infinite  in  both  directions  does  not  add 
power  to  the  Turing  machine. 

3.23  Describe  in  detail  a  Turing  machine  with  unlimited  storage  capacity  that  recognizes  the 
language  {0mlm 1 1  <  m}. 

3.24  Sketch  a  proof  that  in  0(n4)  steps  a  Turing  machine  can  verify  that  a  particular  tour 
of  n  cities  in  an  instance  of  the  Traveling  Salesperson  Problem  satisfies  the  requirement 
that  the  total  distance  traveled  is  less  than  or  equal  to  the  limit  k  set  on  this  instance  of 
the  Traveling  Salesperson  Problem. 

3.25  Design  the  additional  circuitry  needed  to  transform  a  sequential  circuit  for  a  random- 
access  memory  into  one  for  a  tape  memory.  Give  upper  bounds  on  the  size  and  depth 
of  the  next-state  and  output  functions  that  are  simultaneously  achievable. 

3.26  In  the  proof  of  Theorem  3.8.1  it  is  assumed  that  the  words  and  their  addresses  in  a 
RAM  memory  unit  are  placed  on  the  tape  of  a  Turing  machine  in  order  of  increasing 
addresses,  as  suggested  by  Fig.  3.40.  The  addresses,  which  are  [log  m]  bits  in  length, 
are  organized  as  a  collection  of  [  [log  m]  /6"|  6-bit  words.  (In  the  example,  6=1.)  An 
address  is  written  on  tape  cells  that  immediately  precede  the  value  of  the  corresponding 
RAM  word.  A  RAM  address  addr  is  stored  on  the  tape  to  the  left  in  the  shaded  region. 
Assume  that  markers  can  be  placed  on  cells.  (This  amounts  to  enlarging  the  tape  al¬ 
phabet  by  a  constant  factor.)  Show  that  markers  can  be  used  to  move  from  the  first 
word  whose  RAM  address  matches  the  ib  most  significant  bits  of  the  address  a  to  the 
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Figure  3.40  A  TM  tape  with  markers  on  words  and  the  first  bit  of  each  address. 
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next  one  that  matches  the  (z  +  1)6  most  significant  bits.  Show  that  this  procedure  can 
be  used  to  find  the  RAM  word  whose  address  matches  addr  in  0((m/6)(logm)2) 
Turing  machine  steps  by  a  machine  that  can  store  in  its  control  unit  only  one  6-bit 
subword  of  addr. 

3.27  Extend  Problem  3.26  by  demonstrating  that  the  simulation  can  be  done  with  a  binary 
tape  symbol  alphabet. 

3.28  Extend  Theorem  3.8.1  to  show  that  there  exists  a  Turing  machine  that  can  simulate  an 
unbounded-memory  RAM. 

3.29  Sketch  a  proof  that  every  Turing  machine  can  be  simulated  by  a  RAM  program  of  the 
kind  described  in  Section  3.4.3. 

Hint:  Because  such  RAM  programs  can  only  have  a  finite  number  of  registers,  encode 
the  contents  of  the  TM  tape  as  a  number  to  be  stored  in  one  register. 

COMPUTATIONAL  INEQUALITIES  FOR  TURING  MACHINES 

3.30  Show  that  a  one-tape  Turing  machine  needs  time  exponential  in  n  to  compute  most 
Boolean  functions  /  :  Bn  i— >  B  on  n  variables,  regardless  of  how  much  memory  is 
allocated  to  the  computation. 

3.31  Apply  Theorem  3.2.2  to  the  one-tape  Turing  machine  that  executes  T  steps.  Deter¬ 
mine  whether  the  resulting  inequalities  are  weaker  or  stronger  than  those  given  in  The¬ 
orem  3.9.2. 

3.32  "Write  a  program  in  your  favorite  language  for  the  procedure  WRITE_OR(t,  m)  intro¬ 
duced  in  Fig.  3.27. 

3.33  Write  a  program  in  your  favorite  language  for  the  procedure  WRITE_CELL_CIRCUIT(f, 
to)  introduced  in  Fig.  3.27. 

Hint:  See  Problem  2.4. 

FIRST  P-COMPLETE  AND  NP-COMPLETE  PROBLEMS 

3.34  Show  that  the  language  MONOTONE  CIRCUIT  VALUE  defined  below  is  P-complete. 
MONOTONE  CIRCUIT  VALUE 

Instance:  A  description  for  a  monotone  circuit  with  fixed  values  for  its  input  variables 
and  a  designated  output  gate. 

Answer:  “Yes”  if  the  output  of  the  circuit  has  value  1 . 

Hint:  Using  dual-rail  logic,  find  a  way  to  translate  (reduce)  a  string  in  the  language 
CIRCUIT  VALUE  to  a  string  in  MONOTONE  CIRCUIT  VALUE  by  converting  in  loga¬ 
rithmic  space  (in  the  length  of  the  string)  a  circuit  over  the  standard  basis  to  a  circuit 
over  the  monotone  basis.  Note  that,  as  stated  in  the  text,  the  composition  of  two 
logarithmic-space  reductions  is  a  logarithmic-space  reduction.  To  simplify  the  con¬ 
version  from  non-monotone  circuits  to  monotone  circuits,  use  even  integers  to  index 
vertices  in  the  non-monotone  circuits  so  that  both  even  and  odd  integers  can  be  used 
in  the  monotone  case. 

3.35  Show  that  the  language  FAN-OUT  2  CIRCUIT  SAT  defined  below  is  NP-complete. 
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FAN-OUT  2  CIRCUIT  SAT 

Instance:  A  description  for  a  circuit  of  fan-out  2  with  free  values  for  its  input  variables 
and  a  designated  output  gate. 

Answer:  “Yes”  if  the  output  of  the  circuit  has  value  1 . 

Hint:  To  reduce  the  fan-out  of  a  vertex,  replace  the  direct  connections  between  a  gate 
and  its  successors  by  a  binary  tree  whose  vertices  are  AND  gates  with  their  inputs  con¬ 
nected  together.  Show  that,  for  each  gate  of  fan-out  more  than  two,  such  trees  can  be 
generated  by  a  program  that  runs  in  polynomial  time. 

Show  that  clauses  given  in  the  proof  of  Theorem  3.9.7  are  satisfied  only  when  their 
variables  have  values  consistent  with  the  definition  of  the  gate  type. 

A  circuit  with  n  input  variables  {x\,  Xi, . . . ,  xnj  is  satisfiable  if  there  is  an  assignment 
of  values  to  the  variables  such  that  the  output  of  the  circuit  has  value  1 .  Assume  that 
the  circuit  has  only  one  output  and  the  gates  are  over  the  basis  f l  =  {AND,  OR,  NOT}. 

a)  Describe  a  nondeterministic  procedure  that  accepts  as  input  the  description  of  a 
circuit  in  POSE  and  returns  1  if  the  circuit  is  satisfiable  and  0  otherwise. 

b)  Describe  a  deterministic  procedure  that  accepts  as  input  the  description  of  a  circuit 
in  POSE  and  returns  1  if  the  circuit  is  satisfiable  and  0  otherwise.  What  is  the 
running  time  of  this  procedure  when  implemented  on  the  RAM? 

c)  Describe  an  efficient  (polynomial-time)  deterministic  procedure  that  accepts  as  in¬ 
put  the  description  of  a  circuit  in  SOPE  and  returns  1  if  the  circuit  is  satisfiable 
and  0  otherwise. 

d)  By  using  Boolean  algebra,  we  can  convert  a  circuit  from  POSE  to  SOPE.  We  can 
then  use  the  result  of  the  previous  question  to  determine  if  the  circuit  is  satisfiable. 
What  is  the  drawback  of  this  approach? 

CENTRAL  PROCESSING  UNIT 

3.38  Write  an  assembly-language  program  to  multiply  two  binary  numbers  using  the  sim¬ 
ple  CPU  of  Section  3.10.  How  large  are  the  integers  that  can  be  multiplied  without 
producing  numbers  that  are  too  large  to  be  recorded  in  registers? 

3.39  Assume  that  the  simple  CPU  of  Section  3.10  is  modified  to  address  an  unlimited  num¬ 
ber  of  memory  locations.  Show  that  it  can  realize  any  Boolean  function  by  demonstrat¬ 
ing  that  it  can  compute  the  Boolean  operations  AND,  OR,  and  NOT. 

3.40  Design  a  circuit  to  produce  the  timing  variables  tj,  1  <  j  <  k,  of  the  simple  CPU. 
They  must  have  the  property  that  exactly  one  of  them  has  value  1  at  a  time  and  they 
successively  become  1 . 

Hint:  Design  a  circuit  that  counts  sequentially  modulo  k,  an  integer.  That  is,  it  incre¬ 
ments  a  binary  number  until  it  reaches  k,  after  which  it  resets  the  number  to  zero.  See 
Problem  3.13. 

3.41  Complete  the  design  of  the  CPU  of  Section  3.10  by  describing  circuits  for  PC,  MAR, 
MDR,  OPC,  INR,  and  OUTR. 

3.42  Show  that  an  indirect  store  operation  can  be  simulated  by  the  computer  of  Section  3.10. 


3.36 

3.37 
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Hint:  Construct  a  program  that  temporarily  moves  the  value  of  AC  aside,  fetches  the 
address  containing  the  destination  for  the  store,  and  uses  Boolean  operations  to  modify 
a  STA  instruction  in  the  program  so  that  it  contains  the  destination  address. 

3.43  Write  an  assembly-language  program  that  repeatedly  examines  the  input  register  until 
it  is  nonzero  and  then  moves  its  contents  to  the  accumulator. 

3.44  Sketch  an  assembly-language  program  to  emulate  a  target  CPU  by  a  host  CPU  under 
the  assumption  that  each  CPU’s  instruction  set  supports  indirection.  Provide  a  skeleton 
program  that  reads  an  instruction  from  the  target  instruction  set  and  decides  which  host 
instruction  to  execute.  Also  sketch  the  particular  host  instructions  needed  to  emulate  a 
target  add  instruction  and  a  target  jump-on-zero  instruction. 

Chapter  Notes 

Although  the  concept  of  the  finite-state  machine  is  fully  contained  in  the  Turing  machine 
model  (Section  3.7)  introduced  in  1936  [338],  the  finite-state  machine  did  not  become  a  se¬ 
rious  object  of  study  until  the  1950s.  Mealy  [215]  and  Moore  [223]  introduced  models  for 
finite-state  machines  that  were  shown  to  be  equivalent.  The  Moore  model  is  used  in  Sec¬ 
tion  3.1.  Rabin  and  Scott  [266]  introduced  the  nondeterministic  machine,  although  not  de¬ 
fined  in  terms  of  external  choice  inputs  as  it  is  defined  here. 

The  simulation  of  finite-state  machines  by  logic  circuits  exhibited  in  Section  3.1.1  is  due 
to  Savage  [285],  as  is  its  application  to  random-access  (Section  3.6)  and  deterministic  Tur¬ 
ing  machines  (Section  3.9.1)  [286].  The  design  of  a  simple  CPU  owes  much  to  the  early 
simple  computers  but  is  not  tied  to  any  particular  architecture.  The  assembly  language  of 
Section  3.4.3  is  borrowed  from  Smith  [312], 

The  shallow  circuits  simulating  finite-state  machines  described  in  Section  3.2  are  due  to 
Ladner  and  Fischer  [186]  and  the  existence  of  a  universal  Turing  machine,  the  topic  of  Sec¬ 
tion  3.7,  was  shown  by  Turing  [338]. 

Cook  [74]  identified  the  first  NP-complete  problem  and  Karp  [159]  demonstrated  that  a 
large  number  of  other  problems  are  NP-complete,  including  the  Traveling  Salesperson  prob¬ 
lem.  About  this  time  Levin  [199]  (see  also  [335])  was  led  to  similar  concepts  for  combinatorial 
problems.  Our  construction  in  Section  3.9.1  of  a  satisfiable  circuit  follows  the  general  out¬ 
line  given  by  Papadimitriou  [235]  (who  also  gives  the  reduction  to  SATISFIABILITY)  as  well 
as  the  construction  of  a  circuit  simulating  a  deterministic  Turing  machine  given  by  Savage 
[286],  Cook  also  identified  the  first  P-complete  problem  [75,79].  Ladner  [185]  observed 
that  the  circuit  of  Theorem  3.9.1  could  be  written  by  a  program  using  logarithmic  space, 
thereby  showing  that  CIRCUIT  VALUE  is  P-complete.  More  information  on  P-complete  and 
NP  -complete  problems  can  be  found  in  Chapter  8. 

The  more  sophisticated  simulation  of  a  circuit  by  a  Turing  machine  given  in  Section  3.9.7 
is  due  to  Pippenger  and  Fischer  [252]  with  improvements  by  Schnorr  [301]  and  Savage,  as 
cited  by  Schnorr. 
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Finite-State  Machines  and 
Pushdown  Automata 


The  finite-state  machine  (FSM)  and  the  pushdown  automaton  (PDA)  enjoy  a  special  place  in 
computer  science.  The  FSM  has  proven  to  be  a  very  useful  model  for  many  practical  tasks  and 
deserves  to  be  among  the  tools  of  every  practicing  computer  scientist.  Many  simple  tasks,  such 
as  interpreting  the  commands  typed  into  a  keyboard  or  running  a  calculator,  can  be  modeled 
by  finite-state  machines.  The  PDA  is  a  model  to  which  one  appeals  when  writing  compilers 
because  it  captures  the  essential  architectural  features  needed  to  parse  context-free  languages, 
languages  whose  structure  most  closely  resembles  that  of  many  programming  languages. 

In  this  chapter  we  examine  the  language  recognition  capability  of  FSMs  and  PDAs.  We 
show  that  FSMs  recognize  exactly  the  regular  languages,  languages  defined  by  regular  expres¬ 
sions  and  generated  by  regular  grammars.  We  also  provide  an  algorithm  to  find  a  FSM  that  is 
equivalent  to  a  given  FSM  but  has  the  fewest  states. 

We  examine  language  recognition  by  PDAs  and  show  that  PDAs  recognize  exactly  the 
context-free  languages,  languages  whose  grammars  satisfy  less  stringent  requirements  than  reg¬ 
ular  grammars.  Both  regular  and  context-free  grammar  types  are  special  cases  of  the  phrase- 
structure  grammars  that  are  shown  in  Chapter  5  to  be  the  languages  accepted  by  Turing  ma¬ 
chines. 

It  is  desirable  not  only  to  classify  languages  by  the  architecture  of  machines  that  recog¬ 
nize  them  but  also  to  have  tests  to  show  that  a  language  is  not  of  a  particular  type.  For  this 
reason  we  establish  so-called  pumping  lemmas  whose  purpose  is  to  show  how  strings  in  one 
language  can  be  elongated  or  “pumped  up.”  Pumping  up  may  reveal  that  a  language  does  not 
fall  into  a  presumed  language  category.  We  also  develop  other  properties  of  languages  that 
provide  mechanisms  for  distinguishing  among  language  types.  Because  of  the  importance  of 
context-free  languages,  we  examine  how  they  are  parsed,  a  key  step  in  programming  language 
translation. 
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4.1  Finite-State  Machine  Models 

The  deterministic  finite-state  machine  (DFSM),  introduced  in  Section  3.1,  has  a  set  of  states, 
including  an  initial  state  and  one  or  more  final  states.  At  each  unit  of  time  a  DFSM  is  given 
a  letter  from  its  input  alphabet.  This  causes  the  machine  to  move  from  its  current  state  to  a 
potentially  new  state.  While  in  a  state,  the  DFSM  produces  a  letter  from  its  output  alphabet. 
Such  a  machine  computes  the  function  defined  by  the  mapping  from  strings  of  input  letters 
to  strings  of  output  letters.  DFSMs  can  also  be  used  to  accept  strings.  A  string  is  accepted 
by  a  DFSM  if  the  last  state  entered  by  the  machine  on  that  input  string  is  a  final  state.  The 
language  recognized  by  a  DFSM  is  the  set  of  strings  that  it  accepts. 

Although  there  are  languages  that  cannot  be  accepted  by  any  machine  with  a  finite  number 
of  states,  it  is  important  to  note  that  all  realistic  computational  problems  are  finite  in  nature 
and  can  be  solved  by  FSMs.  However,  important  opportunities  to  simplify  computations  may 
be  missed  if  we  do  not  view  them  as  requiring  potentially  infinite  storage,  such  as  that  provided 
by  pushdown  automata,  machines  that  store  data  on  a  pushdown  stack.  (Pushdown  automata 
are  formally  introduced  in  Section  4.8.) 

The  nondeterministic  finite-state  machine  (NFSM)  was  also  introduced  in  Section  3.1. 
The  NFSM  has  the  property  that  for  a  given  state  and  input  letter  there  may  be  several  states 
to  which  it  could  move.  Also  for  some  state  and  input  letter  there  may  be  no  possible  move.  We 
say  that  an  NFSM  accepts  a  string  if  there  is  a  sequence  of  next-state  choices  (see  Section  3.1.5) 
that  can  be  made,  when  necessary,  so  that  the  string  causes  the  NFSM  to  enter  a  final  state. 
The  language  accepted  by  such  a  machine  is  the  set  of  strings  it  accepts. 

Although  nondeterminism  is  a  useful  tool  in  describing  languages  and  computations,  non¬ 
deterministic  computations  are  very  expensive  to  simulate  deterministically:  the  deterministic 
simulation  time  can  grow  as  an  exponential  function  of  the  nondeterministic  computation 
time.  We  explore  nondeterminism  here  to  gain  experience  with  it.  This  will  be  useful  in 
Chapter  8  when  we  classify  languages  by  the  ability  of  nondeterministic  machines  of  infinite 
storage  capacity  to  accept  them.  However,  as  we  shall  see,  nondeterminism  offers  no  ad¬ 
vantage  for  finite-state  machines  in  that  both  DFSMs  and  NFSMs  recognize  the  same  set  of 
languages. 

We  now  begin  our  formal  treatment  of  these  machine  models.  Since  this  chapter  is  con¬ 
cerned  only  with  language  recognition,  we  give  an  abbreviated  definition  of  the  deterministic 
FSM  that  ignores  the  output  function.  We  also  give  a  formal  definition  of  the  nondeterministic 
finite-state  machine  that  agrees  with  that  given  in  Section  3.1.5.  We  recall  that  we  interpreted 
such  a  machine  as  a  deterministic  FSM  that  possesses  a  choice  input  through  which  a  choice 
agent  specifies  the  state  transition  to  take  if  more  than  one  is  possible. 

DEFINITION  4.1.1  A  deterministic  finite-state  machine  (DFSM)  M  is  a  five-tuple  M  = 
(E,  Q,  S,  s,  F)  where  E  is  the  input  alphabet,  Q  is  the  finite  set  of  states,  TQxEhQm 
the  next-state  function,  s  is  the  initial  state,  and  F  is  the  set  of  final  states.  The  DFSM  M 
accepts  the  input  string  w  £  E*  if  the  last  state  entered  by  M  on  application  ofw  starting  in 
state  s  is  a  member  of  the  set  F.  M  recognizes  the  language  L(M)  consisting  of  all  such  strings. 

A  nondeterministic  FSM  (NFSM)  is  similarly  defined  except  that  the  next-state  function  5 
is  replaced  by  a  next-set  function  6  :  Q  x  E  i — >  2(-  that  associates  a  set  of  states  with  each 
state-input  pair  ( q,a ).  The  NFSM  M  accepts  the  string  in  £  E*  if  there  are  next-state  choices, 
whenever  more  than  one  exists,  such  that  the  last  state  entered  under  the  input  string  w  is  a  member 
ofF.  M  accepts  the  language  L(M)  consisting  of  all  such  strings. 
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Figure  4.1  The  deterministic  finite-state  machines  Af0(jd/even  that  accepts  strings  containing 
an  odd  number  of  0’s  and  an  even  number  of  l’s. 


Figure  4.1  shows  a  DFSM  Modd /even  with  initial  state  <70.  The  final  state  is  shown  as 
a  shaded  circle;  that  is,  F  =  {<72  }•  Tf0dd/even  is  in  state  qo  or  (72  as  long  as  the  number 
of  Is  in  its  input  is  even  and  is  in  state  <71  or  (73  as  long  as  the  number  of  Is  in  its  input  is 
odd.  Similarly,  M0 dd/even  is  in  state  qo  or  q\  as  long  as  the  number  of  0’s  in  its  input  is  even 
and  is  in  states  (72  or  <73  as  long  as  the  number  of  0’s  in  its  input  is  odd.  Thus,  Mod d/even 
recognizes  the  language  of  binary  strings  containing  an  odd  number  of  0’s  and  an  even  number 
of  1  ’s. 

When  the  next-set  function  S  for  an  NFSM  has  value  5(q,a)  =  0,  the  empty  set,  for 
state-input  pair  (<7,  a),  no  transition  is  specified  from  state  q  on  input  letter  a. 

Figure  4.2  shows  a  simple  NFSM  ND  with  initial  state  qo  and  final  state  set  F  =  {qo, 
<73,(75}.  Nondeterministic  transitions  are  possible  from  states  qo,  <73,  and  <75.  In  addition,  no 
transition  is  specified  on  input  0  from  states  q\  and  <72  nor  on  input  1  from  states  qo,  <73,  <74, 
or  <75. 


Figure  4.2  The  nondeterministic  machine  ND . 
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4.2  Equivalence  of  DFSMs  and  NFSMs 

Finite-state  machines  recognizing  the  same  language  are  said  to  be  equivalent.  We  now  show 
that  the  class  of  languages  accepted  by  DFSMs  and  NFSMs  is  the  same.  That  is,  for  each 
NFSM  there  is  an  equivalent  DFSM  and  vice  versa.  The  proof  has  two  symmetrical  steps:  a) 
given  an  arbitrary  DFSM  D\  recognizing  the  language  L(D{),  we  construct  an  NFSM  N\ 
that  accepts  L(D\),  and  b)  given  an  arbitrary  NFSM  N2  that  accepts  L(Nx),  we  construct  a 
DFSM  D2  that  recognizes  L(Nx).  The  first  half  of  this  proof  follows  immediately  from  the 
fact  that  a  DFSM  is  itself  an  NFSM.  The  second  half  of  the  proof  is  a  bit  more  difficult  and 
is  stated  below  as  a  theorem.  The  method  of  proof  is  quite  simple,  however.  We  construct  a 
DFSM  Z?2  that  has  one  state  for  each  set  of  states  that  the  NFSM  N2  can  reach  on  some  input 
string  and  exhibit  a  next-state  function  for  Z?2-  We  illustrate  this  approach  with  the  NFSM 
N2  =  M)  of  Fig.  4.2. 

Since  the  initial  state  of  ND  is  <70,  the  initial  state  of  D2  =  Mequ ;v,  the  DFSM  equivalent 
to  ND,  is  the  set  {<7o}-  In  turn,  because  qo  has  two  successor  states  on  input  0,  namely  q\  and 
<72,  we  let  {<7i ,  <72}  be  the  successor  to  {<70}  in  Mequ;v  on  input  0,  as  shown  in  the  following 
table.  Since  q0  has  no  successor  on  input  1,  the  successor  to  {<70}  on  input  1  is  the  empty  set  0. 
Building  in  this  fashion,  we  find  that  the  successor  to  {<71,  <72}  on  input  1  is  {(73,  <74}  whereas 
its  successor  on  input  0  is  0.  The  reader  can  complete  the  table  shown  below.  Here  <7equiv  is 
the  name  of  a  state  of  the  DFSM  Mequiv. 


mequiv 

a 

<^Afequiv  (9equiv,  0) 

Mequiv 

9 

{90} 

0 

{91,92} 

{90} 

a 

{<*>} 

1 

0 

{91,92} 

h 

{<71.92} 

0 

0 

{93,94} 

c 

{<71.92} 

1 

{93,94} 

{91,92,95} 

d 

{93.94} 

0 

{91,92,95} 

0 

QR 

{93.94} 

1 

0 

{91.92,95} 

0 

{91,92} 

{91,92,95} 

1 

{93,94} 

In  the  second  table  above,  we  provide  a  new  label  for  each  state  <7equiv  of  Mequ-lv.  In 
Fig.  4.3  we  use  these  new  labels  to  exhibit  the  DFSM  Mequiv  equivalent  to  the  NFSM  ND  of 
Fig.  4.2.  A  final  state  of  Mequiv  is  any  set  containing  a  final  state  of  ND  because  a  string  takes 
-Afequiv  to  such  a  set  if  and  only  if  it  can  take  ND  to  one  of  its  final  states.  We  now  show  that 
this  method  of  constructing  a  DFSM  from  an  NFSM  always  works. 

THEOREM  4.2. 1  Let  L  be  a  language  accepted  by  a  nondeterministic  finite-state  machine  M\. 
There  exists  a  deterministic  finite-state  machine  M2  that  recognizes  L. 

Proof  Let  M\  =  (X,  Qi,  <5i,  S\,  F\)  be  an  NFSM  that  accepts  the  language  L.  We  design 
a  DFSM  M2  =  (X,  Q2,  82,  S2,  Ffi)  that  also  recognizes  L.  M\  and  M2  have  identical  input 
alphabets,  X.  The  states  of  M2  are  associated  with  subsets  of  the  states  of  Q\,  which  is 
denoted  by  Q2  C  2®',  where  2®'  is  the  power  set  of  Q\  containing  all  the  subsets  of  Q\, 
including  the  empty  set.  We  let  the  initial  state  S2  of  M2  be  associated  with  the  set  {si} 
containing  the  initial  state  of  M\ .  A  state  of  M2  is  a  set  of  states  that  M\  can  reach  on  a 
sequence  of  inputs.  A  final  state  of  M2  is  a  subset  of  Q\  that  contains  a  final  state  of  M\. 
For  example,  if  <75  £  F\,  then  {<72,  <75}  £  Fi. 
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Figure  4.3  The  DFSM  Mequ iv  equivalent  to  the  NFSM  ND. 


(k) 

We  first  give  an  inductive  definition  of  the  states  of  M2.  Let  Q\  ;  denote  the  sets  of  states 
of  M\  that  can  be  reached  from  Si  on  input  strings  containing  k  or  fewer  letters.  In  the 
example  given  above,  Q©  =  {{g0}>  {<Ji>  Qz},  Qr}  and  Q©  =  {{<70},  {q\,  qi},  {?3>  94}, 
{<7i ,  <72,  (75},  qn}-  To  construct  Qj© ' 1  from  Q©,  we  form  the  subset  of  Q\  that  can  be 
reached  on  each  input  letter  from  a  subset  in  Q©,  as  illustrated  above.  If  this  is  a  new  set, 
it  is  added  to  Q©  to  form  When  Q©  and  Qj©'  are  the  same,  we  terminate 

this  process  since  no  new  subsets  of  Q\  can  be  reached  from  Si.  This  process  eventually 
terminates  because  Q2  has  at  most  2^'^  elements.  It  terminates  in  at  most  2iq.i  - 

1  steps 

because  starting  from  the  initial  set  {<70}  at  least  one  new  subset  must  be  added  at  each  step. 

The  next-state  function  62  of  M2  is  defined  as  follows:  for  each  state  q  of  M2  (a  subset 
of  Q\),  the  value  of  8i{q,  a)  for  input  letter  a  is  the  state  of  M2  (subset  of  Q\)  reached  from 
q  on  input  a.  As  the  sets  <5©, . . . ,  Q©^  are  constructed,  m  <  2^1I  —  1,  we  construct  a 
table  for  82- 

We  now  show  by  induction  on  the  length  of  an  input  string  z  that  if  2:  can  take  M\  to 
a  state  in  the  set  S  C  Ql,  then  it  takes  M2  to  its  state  associated  with  S.  It  follows  that  if  S 
contains  a  final  state  of  Mi,  then  z  is  accepted  by  both  Mj  and  M2. 

The  basis  for  the  inductive  hypothesis  is  the  case  of  the  empty  input  letter.  In  this  case, 
Si  is  reached  by  Mi  if  and  only  if  { s  1 }  is  reached  by  M2.  The  inductive  hypothesis  is  that 
if  w  of  length  n  can  take  M\  to  a  state  in  the  set  S,  then  it  takes  M2  to  its  state  associated 
with  S.  We  assume  the  hypothesis  is  true  on  inputs  of  length  n  and  show  that  it  remains 
true  on  inputs  of  length  n  +  1.  Let  z  =  wa  be  an  input  string  of  length  n  +  1.  To  show 
that  z  can  take  M:  to  a  state  in  S'  if  and  only  if  it  takes  M2  to  the  state  associated  with  S1, 
observe  that  by  the  inductive  hypothesis  there  exists  a  set  S'  C  such  that  w  can  take  M\ 
to  a  state  in  S  if  and  only  if  it  takes  M2  to  the  state  associated  with  S.  By  the  definition 
of  82,  the  input  letter  a  takes  the  states  of  Mi  in  S  into  states  of  Mi  in  S'  if  and  only  if  a 
takes  the  state  of  M2  associated  with  S  to  the  state  associated  with  S' .  It  follows  that  the 
inductive  hypothesis  holds.  ■ 

Up  to  this  point  we  have  shown  equivalence  between  deterministic  and  nondeterministic 
FSMs.  Another  equivalence  question  arises  in  this  context:  It  is,  “Given  an  FSM,  is  there  an 
equivalent  FSM  that  has  a  smaller  number  of  states?”  The  determination  of  an  equivalent  FSM 
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with  the  smallest  number  of  states  is  called  the  state  minimization  problem  and  is  explored 
in  Section  4.7. 

4.3  Regular  Expressions 

In  this  section  we  introduce  regular  expressions,  algebraic  expressions  over  sets  of  individual 
letters  that  describe  the  class  of  languages  recognized  by  finite-state  machines,  as  shown  in  the 
next  section. 

Regular  expressions  are  formed  through  the  concatenation,  union,  and  Kleene  closure  of 
sets  of  strings.  Given  two  sets  of  strings  L\  and  L2,  their  concatenation  L\  ■  Li  is  the  set 
{uv  |  u  £  L\  and  v  €  L2};  that  is,  the  set  of  strings  consisting  of  an  arbitrary  string  of  L\ 
followed  by  an  arbitrary  string  of  L2.  (We  often  omit  the  concatenation  operator  •,  writing 
variables  one  after  the  other  instead.)  The  union  of  L\  and  L2,  denoted  L\  U  L2,  is  the  set 
of  strings  that  are  in  L\  or  L2  or  both.  The  Kleene  closure  of  a  set  L  of  strings,  denoted  L* 
(also  called  the  Kleene  star),  is  defined  in  terms  of  the  i-fold  concatenation  of  L  with  itself, 
namely,  Ll  =  L  ■  Ll~l,  where  L°  =  {e},  the  set  containing  the  empty  string: 

OO 

£*  =  U£< 

2—0 

Thus,  L*  is  the  union  of  strings  formed  by  concatenating  zero  or  more  words  of  L.  Finally,  we 
define  the  positive  closure  of  L  to  be  the  union  of  all  i-fold  products  except  for  the  zeroth, 
that  is, 

00 

L+  =  (JL‘ 

2=1 

The  positive  closure  is  a  useful  shorthand  in  regular  expressions. 

An  example  is  helpful.  Let  L\  ={01,11}  and  L2  =  {0,  aba };  then  LjL2  =  {010,  01a&a, 
110,  lla&a},  L\  U  L2  =  {0,  01,  ll,a&a},  and 

L 2  =  {0,  aba}*  =  {e,  0,  aba,  00, 0 aba,  abaO,  abaaba, . . .} 

Note  that  the  definition  given  earlier  for  S*,  namely,  the  set  of  strings  over  the  finite  alphabet 
S,  coincides  with  this  new  definition  of  the  Kleene  closure.  We  are  now  prepared  to  define 
regular  expressions. 

DEFINITION  4.3. 1  Regular  expressions  over  the  finite  alphabet  S  and  the  languages  they  de¬ 
scribe  are  defined  recursively  as  follows: 

1.  0  is  a  regular  expression  denoting  the  empty  set. 

2.  e  is  a  regular  expression  denoting  the  set  {e}. 

3.  For  each  letter  a  £  E,  a  is  a  regular  expression  denoting  the  set  {a}  containing  a. 

4.  Ifr  and  s  are  regular  expressions  denoting  the  languages  R  and  S,  then  ( rs ),  (r  +  s),  and 
(r*)  are  regular  expressions  denoting  the  languages  R  ■  S,  RU  S,  and  R* ,  respectively. 

The  languages  denoted  by  regular  expressions  are  called  regular  languages.  (They  are  also  often 
called  regular  sets.) 
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Figure  4.4  A  finite-state  machine  computing  the  EXCLUSIVE  OR  of  its  inputs. 


Some  examples  of  regular  expressions  will  clarify  the  definitions.  The  regular  expression 
(0  +  1)*  denotes  the  set  of  all  strings  over  the  alphabet  {0,  1}.  The  expression  (0*)(1) 
denotes  the  strings  containing  zero  or  more  0’s  that  end  with  a  single  1.  The  expression 
((1)(0*)(1)  +  0)*  denotes  strings  containing  an  even  number  of  l’s.  Thus,  the  expression 
((0*)(1))((1)(0*)(1)  +0)*  denotes  strings  containing  an  odd  number  of  Is.  This  is  exactly 
the  class  of  strings  recognized  by  the  simple  DFSM  in  Fig.  4.4.  (So  far  we  have  set  in  boldface 
all  regular  expressions  denoting  sets  containing  letters.  Since  context  will  distinguish  between 
a  set  containing  a  letter  and  the  letter  itself,  we  drop  the  boldface  notation  at  this  point.) 

Some  parentheses  in  regular  expressions  can  be  omitted  if  we  give  highest  precedence  to 
Kleene  closure,  next  highest  precedence  to  concatenation,  and  lowest  precedence  to  union.  For 
example,  we  can  write  ((0*)(l))((l)(0*)(l)  +  0)*  as  0*1(10*1  +  0)*. 

Because  regular  expressions  denote  languages,  certain  combinations  of  union,  concatena¬ 
tion,  and  Kleene  closure  operations  on  regular  expressions  can  be  rewritten  as  other  combina¬ 
tions  of  operations.  A  regular  expression  will  be  treated  as  identical  to  the  language  it  denotes. 
Two  regular  expressions  are  equivalent  if  they  denote  the  same  language.  We  now  state 
properties  of  regular  expressions,  leaving  their  proof  to  the  reader. 

THEOREM  4.3. 1  Let 0  and  e  be  the  regidar  expressions  denoting  the  empty  set  and  the  set  contain¬ 
ing  the  empty  string  and  let  r,  s,  and  t  be  arbitrary  regular  expressions.  Then  the  rules  shown  in 
Fig.  4.5  hold. 

We  illustrate  these  rules  with  the  following  example.  Let  a  =  0*  1  ■  6+0*,  where  b  =  c- 10+ 
and  c  =  (0  +  10+l)*.  Using  rule  (16)  of  Fig.  4.5,  we  rewrite  c  as  follows: 

c  =  (0+  10+1)*  =  (0*10+1)*0* 

Then  using  rule  (15)  with  r  =  0*  10+  and  s  =  1,  we  write  6  as  follows: 

6  =  (0*10+1)*0*10+  =  ( rs)*r  =  r(sr)*  =  0*10+(10*10+)* 

It  follows  that  a  satisfies 

a  =  0*1 -6+0* 

=  0*10*10+(10*10+)* +  0* 

=  0*(10*10+)+  +  0* 

=  0*((10*10+)+  +  e) 

=  0*(10*10+)* 
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(1) 

7’0 

= 

0r 

(2) 

re 

= 

er 

(3) 

r  +  0 

= 

0  +  r 

(4) 

r  +  r 

= 

r 

(5) 

r  +  s 

= 

s  +  r 

(6) 

r(s  + 1) 

= 

rs  +  rt 

(7) 

(r  +  s)t 

= 

rt  +  st 

(8) 

r(st) 

= 

(rs)t 

(9) 

0* 

= 

e 

(10) 

e* 

= 

e 

(11) 

(e  +  r)+ 

= 

* 

r 

(12) 

(e  +  r)* 

= 

r* 

(13) 

r*(e  +  r) 

= 

(e  +  r)r‘ 

(14) 

r*s  +  s 

= 

r*s 

(15) 

r(sr)* 

= 

(rs)*r 

(16) 

(r  +  a)* 

= 

( r*s)*r * 

Figure  4.5  Rules  that  apply  to  regular  expressions. 


where  we  have  simplified  the  expressions  using  the  definition  of  the  positive  closure,  namely 
r(r*)  =  r+  in  the  second  equation  and  rules  (6),  (5),  and  (12)  in  the  last  three  equations. 
Other  examples  of  the  use  of  the  identities  can  be  found  in  Section  4.4. 

4.4  Regular  Expressions  and  FSMs 

Regular  languages  are  exactly  the  languages  recognized  by  finite-state  machines,  as  we  now 
show.  Our  two-part  proof  begins  by  showing  (Section  4.4.1)  that  every  regular  language  can 
be  accepted  by  a  nondeterministic  finite-state  machine.  This  is  followed  in  Section  4.4.2  by 
a  proof  that  the  language  recognized  by  an  arbitrary  deterministic  finite-state  machine  can  be 
described  by  a  regular  expression.  Since  by  Theorem  4.2.1  the  language  recognition  power  of 
DFSMs  and  NFSMs  are  the  same,  the  desired  conclusion  follows. 

4.4. 1  Recognition  of  Regular  Expressions  by  FSMs 

THEOREM  4.4. 1  Given  a  regular  expression  r  over  the  set  S,  there  is  a  nondeterministic  finite-state 
machine  that  accepts  the  language  denoted  by  r. 

Proof  We  show  by  induction  on  the  size  of  a  regular  expression  r  (the  number  of  its  opera¬ 
tors)  that  there  is  an  NFSM  that  accepts  the  language  described  by  r. 

BASIS:  If  no  operators  are  used,  the  regular  expression  is  either  e,  0,  or  a  for  some  a  £  S. 
The  finite-state  machines  shown  in  Fig.  4.6  recognize  these  three  languages. 
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(a)  (b)  (c) 

Figure  4.6  Finite-state  machines  recognizing  the  regular  expressions  e,  0,  and  a,  respectively. 
In  b)  an  output  state  is  shown  even  though  it  cannot  be  reached. 


INDUCTION:  Assume  that  the  hypothesis  holds  for  all  regular  expressions  r  with  at  most  k 
operators.  We  show  that  it  holds  for  k  +  1  operators.  Since  k  is  arbitrary,  it  holds  for  all  k. 
The  outermost  operator  (the  k  +  1st)  is  either  concatenation,  union,  or  Kleene  closure.  We 
argue  each  case  separately. 

CASE  1:  Let  r  =  (r i  •  ©.  M\  and  M2  are  the  NFSMs  that  accept  r  1  and  r2,  respectively. 
By  the  inductive  hypothesis,  such  machines  exist.  Without  loss  of  generality,  assume  that  the 
states  of  these  machines  are  distinct  and  let  them  have  initial  states  Si  and  S2,  respectively. 
As  suggested  in  Fig.  4.7,  create  a  machine  M  that  accepts  r  as  follows:  for  each  input  letter 
a,  final  state  /  of  M\ ,  and  state  q  of  M2  reached  by  an  edge  from  S2  labeled  cr,  add  an  edge 
with  the  same  label  a  from  /  to  q.  If  S2  is  not  a  final  state  of  M2,  remove  the  final  state 
designations  from  states  of  Mi. 

It  follows  that  every  string  accepted  by  M  either  terminates  on  a  final  state  of  M\  (when 
M2  accepts  the  empty  string)  or  exits  a  final  state  of  M\  (never  to  return  to  a  state  of  Mi), 
enters  a  state  of  M2  reachable  on  one  input  letter  from  the  initial  state  of  M2,  and  terminates 
on  a  final  state  of  M2.  Thus,  M  accepts  exactly  the  strings  described  by  r. 

CASE  2:  Let  r  =  (r  1  +  ©.  Let  Mi  and  M2  be  NFSMs  with  distinct  sets  of  states  and  let 
initial  states  Si  and  S2  accept  r\  and  V2 ,  respectively.  By  the  inductive  hypothesis,  M\  and 
M2  exist.  As  suggested  in  Fig.  4.8,  create  a  machine  M  that  accepts  r  as  follows:  a)  add  a 
new  initial  state  So;  b)  for  each  input  letter  a  and  state  q  of  M\  or  M2  reached  by  an  edge 


Figure  4.7  A  machine  M  recognizing  ri  •  r2.  Mi  and  M2  are  the  NFSMs  that  accept  r\  and 
r2,  respectively.  An  edge  with  label  a  is  added  between  each  final  state  of  Mi  and  each  state  of  M2 
reached  on  input  a  from  its  start  state,  S2  .  The  final  states  of  M2  are  final  states  of  M,  as  are  the 
final  states  of  M 1  if  S2  is  a  final  of  M2.  It  follows  that  this  machine  accepts  the  strings  beginning 
with  a  string  in  ri  followed  by  one  in  r 2 . 
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Figure  4.8  A  machine  M  accepting  ri  +  /J2 .  M\  and  M2  are  the  NFSMs  that  accept  r  1  and 
rj ,  respectively.  The  new  start  state  So  has  an  edge  labeled  a  for  each  edge  with  this  label  from  the 
initial  state  of  M\  or  M2  ■  The  final  states  of  M  are  the  final  states  of  M\  and  AT  as  well  as  So  if 
either  Si  or  S2  is  a  final  state.  After  the  first  input  choice,  the  new  machine  acts  like  either  M\  or 
M2.  Therefore,  it  accepts  strings  denoted  by  r\  -\-  r 2. 


from  Si  or  S2  labeled  a,  add  an  edge  with  the  same  label  from  So  to  q.  If  either  Si  or  S2  is  a 
final  state,  make  So  a  final  state. 

It  follows  that  if  either  M\  or  M2  accepts  the  empty  string,  so  does  M .  On  the  first 
non-empty  input  letter  M  enters  and  remains  in  either  the  states  of  M\  or  those  of  M2.  It 
follows  that  it  accepts  either  the  strings  accepted  by  Mi  or  those  accepted  by  M2  (or  both), 
that  is,  the  union  of  r \  and  T2- 

CASE  3:  Let  r  =  (n)*.  Let  Mi  be  an  NFSM  with  initial  state  Si  that  accepts  r\,  which, 
by  the  inductive  hypothesis,  exists.  Create  a  new  machine  M,  as  suggested  in  Fig.  4.9,  as 
follows:  a)  add  a  new  initial  state  So;  b)  for  each  input  letter  a  and  state  q  reached  on  a  from 
Si,  add  an  edge  with  label  <7  between  s 0  and  state  q  with  label  a,  as  in  Case  2;  c)  add  such 
edges  from  each  final  state  to  these  same  states.  Make  the  new  initial  state  a  final  state  and 
remove  the  initial-state  designation  from  Si. 

It  follows  that  M  accepts  the  empty  string,  as  it  should  since  r  =  (rfi)*  contains  the 
empty  string.  Since  the  edges  leaving  each  final  state  are  those  directed  away  from  the  initial 
state  So,  it  follows  that  M  accepts  strings  that  are  the  concatenation  of  strings  in  r\,  as  it 
should.  ■ 

We  now  illustrate  this  construction  of  an  NFSM  from  a  regular  expression.  Consider  the 
regular  expression  r  =  10*  +  0,  which  we  decompose  as  r  =  (rir2  +  r$)  where  rj  =  1, 
9"2  =  (j-4)*,  r$  =  0,  and  r 4  =  0.  Shown  in  Fig.  4.10(a)  is  a  NFSM  accepting  the  languages 
denoted  by  the  regular  expressions  r$  and  r±,  and  in  (b)  is  an  NFSM  accepting  r\ .  Figure  4.11 
shows  an  NFSM  accepting  the  closure  of  r 4  obtained  by  adding  a  new  initial  state  (which  is 
also  made  a  final  state)  from  which  is  directed  a  copy  of  the  edge  directed  away  from  the  initial 
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Figure  4.9  A  machine  M  accepts  r* .  M\  accepts  r\.  Make  So  the  initial  state  of  M.  For 
each  input  letter  a,  add  an  edge  labeled  a  from  So  and  each  final  of  M\  to  each  state  reached  on 
input  a  from  Si,  the  initial  state  of  M\.  The  final  states  of  M  are  So  and  the  final  states  of  M\. 
Thus,  M  accepts  e  and  all  states  reached  by  the  concatenation  of  strings  accepted  by  M\ ;  that  is, 
it  realizes  the  closure  r * . 


Figure  4. 1  I  An  NFSM  accepting  the  Kleene  closure  of  {0}. 


Figure  4. 1 2  A  nondeterministic  machine  accepting  10*. 
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0 


Figure  4. 1  3  A  nondeterministic  machine  accepting  10*  +  0. 


state  of  Mo,  the  machine  accepting  r^.  (The  state  Si  is  marked  as  inaccessible.)  Figure  4.12 
(page  163)  shows  an  NFSM  accepting  r\V2  constructed  by  concatenating  the  machine  M\ 
accepting  r\  with  M2  accepting  T2-  (si  is  inaccessible.)  Figure  4.13  gives  an  NFSM  accepting 
the  language  denoted  by  r^+r^,  designed  by  forming  the  union  of  machines  for  r  iT2  and  i'$. 
(States  S2  and  S3  are  inaccessible.)  Figure  4.14  shows  a  DFSM  recognizing  the  same  language 
as  that  accepted  by  the  machine  in  Fig.  4.13.  Here  we  have  added  a  reject  state  qn  to  which  all 
states  move  on  input  letters  for  which  no  state  transition  is  defined. 

4.4.2  Regular  Expressions  Describing  FSM  Languages 

We  now  give  the  second  part  of  the  proof  of  equivalence  of  FSMs  and  regular  expressions.  We 
show  that  every  language  recognized  by  a  DFSM  can  be  described  by  a  regular  expression.  We 
illustrate  the  proof  using  the  DFSM  of  Fig.  4.3,  which  is  the  DFSM  given  in  Fig.  4.15  except 
for  a  relabeling  of  states. 

THEOREM  4.4.2  If  the  language  L  is  recognized  by  a  DFSM  M  =  (£,  Q,  5,  s,  F ),  then  L  can 
be  represented  by  a  regular  expression. 


Figure  4.14  A  deterministic  machine  accepting  10*  +  0. 
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Figure  4. 1  5  The  DFSM  of  Figure  4.3  with  a  relabeling  of  states. 


Proof  Let  Q  =  {gi,  q2, .  ■  ■ ,  qn}  and  F  =  {g7l ,  Qj2, . .  . ,  qjp}  be  the  final  states.  The 
proof  idea  is  the  following.  For  every  pair  of  states  ( qi,qj )  of  M  we  construct  a  regular 
expression  r,-°^  denoting  the  set  R[°j  containing  input  letters  that  take  M  from  g^  to  qj 

without  passing  through  any  other  states.  If  *  =  j,  r[°J  contains  the  empty  letter  e  because 
M  can  move  from  g;  to  qi  without  reading  an  input  letter.  (These  definitions  are  illustrated 
in  the  table  of  Fig.  4.16.)  For  k  =  1,  2, . . . ,  m  we  proceed  to  define  the  set  of 
strings  that  take  M  from  qi  to  qj  without  passing  through  any  state  except  possibly  one  in 
Q(k)  =  jg1;  q2, . . . ,  q^j.  We  also  associate  a  regular  expression  rf^  with  the  set  R[  kj  .  Since 
Q(n>  =  Q,  the  input  strings  that  carry  M  from  s  =  qt,  the  initial  state,  to  a  final  state  in  F 
are  the  strings  accepted  by  M.  They  can  be  described  by  the  following  regular  expression: 


„(«)  ,  An) 


t,ji 


t,j2 


( n ) 

rt,jP 


This  method  of  proof  provides  a  dynamic  programming  algorithm  to  construct  a  reg¬ 
ular  expression  for  L. 


T(°)  =  {rg>} 


i\j 

1 

2 

3 

4 

5 

1 

e 

0 

1 

0 

0 

2 

0 

e 

0 

1 

0 

3 

0 

0 

e  +  0  +  1 

0 

0 

4 

0 

0 

1 

e 

0 

5 

0 

0 

0 

i 

6 

Figure  4. 1  6  The  table  containing  the  regular  expressions  {r©  }  associated  with  the  DFSM 
in  shown  in  Fig.  4.15. 
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R'f-  is  formally  defined  below. 


R )0)  = 


{a\8(qi,a)  =  qj}  iti^j 

{a\  6(qi,a)  =  qj}U{e}  if i  =  j 


Since  Rjkj  is  defined  as  the  set  of  strings  that  take  M  from  qi  to  qj  without  passing  through 
states  outside  of  Q^k\  it  can  be  recursively  defined  as  the  strings  that  take  M  from  g,  to 
qj  without  passing  through  states  outside  of  Q(k~r)  plus  those  that  take  M  from  qi  to  qk 
without  passing  through  states  outside  of  Q^k~l\  followed  by  strings  that  take  M  from 
qu  to  qk  zero  or  more  times  without  passing  through  states  outside  Q^k~x\  followed  by 
strings  that  take  M  from  qk  to  qj  without  passing  through  states  outside  of  Q^k~l\  This  is 
represented  by  the  formula  below  and  suggested  in  Fig.  4.17: 


it!—)  =  R^~ ^  U  R^~ ^ 


R 


(fc-1) 

k,j 


(k) 

It  follows  by  induction  on  k  that  R\  •  correctly  describes  the  strings  that  take  M  from  qi  to 
qj  without  passing  through  states  of  index  higher  than  k. 

Ak) 


We  now  exhibit  the  set  {  '<1 3  }  of  regular  expressions  that  describe  the  sets  {f? 


i,j,k  <  m}  and  establish  the  correspondence  by  induction.  If  the  set  R(f  -  contains  the 
letters  X\ ,  X2,  ■  ■  ■ ,  Xi  (which  might  include  the  empty  letter  e),  then  we  let  rf’j  =  X\  +  X2  + 
•  •  -+Xi .  Assume  that  r\  ■  ;  correctly  describes  R\  .  It  follows  that  the  regular 


(■ k ) 


1  < 


M  _  Jk- 


1)  ,  _(*-!)  (Jk-AY 

^'i,k  yk.k  ) 


_(*-») 

'k,j 


-  expression 


(4.1) 


correctly  describes  Rff  ■  This  concludes  the  proof. 


The  dynamic  programming  algorithm  given  in  the  above  proof  is  illustrated  by  the  DFSM 
in  Fig.  4.15.  Because  this  algorithm  can  produce  complex  regular  expressions  even  for  small 
DFSMs,  we  display  almost  all  of  its  steps,  stopping  when  it  is  obvious  which  results  are  needed 
for  the  regular  expression  that  describes  the  strings  recognized  by  the  DFSM.  For  1  <  k  <  6, 


Figure  4.17  A  recursive  decomposition  of  the  set  of  strings  that  cause  an  FSM  to  move 
from  state  qi  to  qj  without  passing  through  states  qi  for  l  >  k. 
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let  TW  denote  the  table  of  values  of  {rg)  j  1  <  i,j  <  6}.  Table  in  Fig.  4.16  describes 
the  next-state  function  of  this  DFSM.  The  remaining  tables  are  constructed  by  invoking  the 
definition  of  r^)  in  (4.1).  Entries  in  table  T ('I  are  formed  using  the  following  facts: 


r(i)  =  _(o) 

*  ,j 


r(°) 

2,1 


e*  =  e;  r =  0  for  i  >  2 


It  follows  that  r^j  =  rfj  or  that  T 0^  is  identical  to  T^°\  Invoking  the  identity  r\2'-  = 
rij)  +  ri2  ( r 2,2  )  r2j  and  using  \  r2,2  )  =  e,  we  construct  the  table  T®  below: 


T(2)  =  {rg>} 


1 

2 

3 

4 

5 

1 

6 

0 

1  +  00 

01 

0 

2 

0 

e 

0 

1 

0 

3 

0 

0 

e  +  0  +  1 

0 

0 

4 

0 

0 

1 

e 

0 

5 

0 

0 

00 

1  +  01 

e 

The  fourth  table  T®  is  shown  below.  It  is  constructed  using  the  identity  r®  =  rg)  + 
ri3  (r 33)  r3j  anci  ^  fact  that  (^3)  =  (0  +  1)*. 


T(3)  =  {rg} 


i\j 

1 

2 

3 

4 

5 

1 

6 

0 

(l  +  00)(0+l)* 

01 

0 

2 

0 

e 

0(0+  1)* 

1 

0 

3 

0 

0 

(0+1)* 

0 

0 

4 

0 

0 

1(0+  1)* 

e 

0 

5 

0 

0 

00(0+  1)* 

1  +  01 

e 

The  fifth  table  T ^  is  shown  below.  It  is  constructed  using  the  identity  rg*  =  rg)  + 
4^4  ^  rg  and  the  fact  that  =  e. 

T(4)  =  {rg)} 


i\j 

1 

2 

3 

4 

5 

1 

e 

0 

(1+00  +  01 1)(0  +  1)* 

01 

010 

2 

0 

e 

(0+  11)(0+  1)* 

1 

10 

3 

0 

0 

(0+1)* 

0 

0 

4 

0 

0 

1(0+  1)* 

e 

0 

5 

0 

0 

(00+  11  +  01 1)(0  +  1)* 

1  +  01 

e  +  10  +  010 
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Instead  of  building  the  sixth  table,  T^\  we  observe  that  the  regular  expression  that  is 


needed  is  r  =  r 


U  +  r?}  +  r?,l  Since  rij  =  rij  +  rit  (r545})  r%j  and  (r545}) 
(10  +  010)*,  we  have  the  following  expressions  for  rpf*,  rj5^ ,  and  r, ^ : 


,( 5)  _ 

'  1,1  —  fc 

„(5) 


rj_4'  =  01  +  (010)(10  +  010)*(1  +01) 

rjp1  =  010  +  (010)(10  +  010)*(e  +  10  +  010)  =  (010)(10  +  010)* 


Thus,  the  DFSM  recognizes  the  language  denoted  by  the  regular  expression  r  =  e  +  01  + 
(010)(10  +  010)*(e  +  1  +  01).  It  can  be  shown  that  this  expression  denotes  the  same  language 
as  does  e  +  01  +  (01 ) (01  +  001)*(e  +  0)  =  (01  +  010)*.  (See  Problem  4.12.) 


4.4.3  grep — Searching  for  Strings  in  Files 

Many  operating  systems  provide  a  command  to  find  strings  in  files.  For  example,  the  Unix 
grep  command  prints  all  lines  of  a  file  containing  a  string  specified  by  a  regular  expression, 
grep  is  invoked  as  follows: 

grep  regular-expression  file_name 

Thus,  the  command  grep  ’o+’  file_name  returns  each  line  of  the  file  file_name  that 
contains  o+  somewhere  in  the  line,  grep  is  typically  implemented  with  a  nondeterministic 
algorithm  whose  behavior  can  be  understood  by  considering  the  construction  of  the  preceding 
section. 

In  Section  4.4.1  we  describe  a  procedure  to  construct  NFSMs  accepting  strings  denoted 
by  regular  expressions.  Each  such  machine  starts  in  its  initial  state  before  processing  an  input 
string.  Since  grep  finds  lines  containing  a  string  that  starts  anywhere  in  the  lines,  these  NFSMs 
have  to  be  modified  to  implement  grep.  The  modifications  required  for  this  purpose  are 
straightforward  and  left  as  an  exercise  for  the  reader.  (See  Problem  4.19.) 


4.5  The  Pumping  Lemma  for  FSMs 

It  is  not  surprising  that  some  languages  are  not  regular.  In  this  section  we  provide  machinery 
to  show  this.  It  is  given  in  the  form  of  the  pumping  lemma,  which  demonstrates  that  if  a 
regular  language  contains  long  strings,  it  must  contain  an  infinite  set  of  strings  of  a  particular 
form.  We  show  the  existence  of  languages  that  do  not  contain  strings  of  this  form,  thereby 
demonstrating  that  they  are  not  regular. 

The  pigeonhole  principle  is  used  to  prove  the  pumping  lemma.  It  states  that  if  there  are 
n  pigeonholes  and  n  +  1  pigeons,  each  of  which  occupies  a  hole,  then  at  least  one  hole  has  two 
pigeons.  This  principle,  whose  proof  is  obvious  (see  Section  1.3),  enjoys  a  hallowed  place  in 
combinatorial  mathematics. 

The  pigeonhole  principle  is  applied  as  follows.  We  first  note  that  if  a  regular  language  L 
is  infinite,  it  contains  a  string  w  with  at  least  as  many  letters  as  there  are  states  in  a  DFSM  M 
recognizing  L.  Including  the  initial  state,  it  follows  that  M  visits  at  least  one  more  state  while 
processing  w  than  it  has  different  states.  Thus,  at  least  one  state  is  visited  at  least  twice.  The 
substring  of  w  that  causes  M  to  move  from  this  state  back  to  itself  can  be  repeated  zero  or 
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more  times  to  give  other  strings  in  the  language.  We  use  the  notation  it”  to  mean  the  string 
repeated  n  times  and  let  u°  =  e. 

LEMMA  4.5. 1  Let  L  be  a  regular  language  over  the  alphabet  E  recognized  by  a  DFSM  with  m 
states.  If  w  £  L  and  |iu|  >  m,  then  there  are  strings  r,  s,  and  t  with  |s|  >  1  and  |rs|  <  m 
such  that  w  =  rst  and  for  all  integers  n  >  0,  rsnt  is  also  in  L. 

Proof  Let  L  be  recognized  by  the  DFSM  M  with  to  states.  Let  k  =  |iu|  >  m  be  the  length 
of  w  in  L.  Let  qo,  q\,  q2,  ■  ■  ■ ,  qk  denote  the  initial  and  k  successive  states  that  M  enters  after 
receiving  each  of  the  letters  in  w.  By  the  pigeonhole  principle,  some  state  q'  in  the  sequence 
q0, . . . ,  qm  (to  <  k )  is  repeated.  Let  qi  =  qj  =  q'  for  i  <  j.  Let  r  =  W\  . . .  w.i  be  the 
string  that  takes  M  from  qo  to  qi  =  q'  (this  string  may  be  empty)  and  let  s  =  Wi+\  . .  .Wj 
be  the  string  that  takes  M  from  qi  =  q'  to  qj  =  q'  (this  string  is  non-empty).  It  follows 
that  |rs|  <  to.  Finally,  let  t  =  Wj+i  . . .  Wk  be  the  string  that  takes  M  from  qj  to  qk ■  Since 
s  takes  M  from  state  q'  to  state  q' ,  the  final  state  entered  by  M  is  the  same  whether  s  is 
deleted  or  repeated  one  or  more  times.  (See  Fig.  4.18.)  It  follows  that  rsnt  is  in  L  for  all 
n  >  0.  ■ 

As  an  application  of  the  pumping  lemma,  consider  the  language  L  =  {0P1P  |  p  >  1}. 
We  show  that  it  is  not  regular.  Assume  it  is  regular  and  is  recognized  by  a  DFSM  with  to 
states.  We  show  that  a  contradiction  results.  Since  L  is  infinite,  it  contains  a  string  w  of  length 
k  =  2p  >  2 to,  that  is,  with  p  >  to.  By  Lemma  4.5.1  L  also  contains  rsnt,  n  >  0,  where 
w  =  rst  and  |rs|  <  m  <  p.  That  is,  s  =  0d  where  d  <  p.  Since  rsnt  =  0p+(”-1)dlp  for 
n  >  0  and  this  is  not  of  the  form  0P1P  for  n  =  0  and  n  >  2,  the  language  is  not  regular. 

The  pumping  lemma  allows  us  to  derive  specific  conditions  under  which  a  language  is 
finite  or  infinite,  as  we  now  show. 

LEMMA  4.5.2  Let  L  be  a  regular  language  recognized  by  a  DFSM  with  m  states.  L  is  non-empty 
if  and  only  if  it  contains  a  string  of  length  less  than  to.  It  is  infinite  if  and  only  if  it  contains  a  string 
of  length  at  least  m  and  at  most  2m  —  1 . 

Proof  If  L  contains  a  string  of  length  less  than  to,  it  is  not  empty.  If  it  is  not  empty,  let  w 
be  a  shortest  string  in  L.  This  string  must  have  length  at  most  to  —  1  or  we  can  apply  the 
pumping  lemma  to  it  and  find  another  string  of  smaller  length  that  is  also  in  L.  But  this 
would  contradict  the  assumption  that  w  is  a  shortest  string  in  L.  Thus,  L  contains  a  string 
of  length  at  most  to  —  1 . 

If  L  contains  a  string  w  of  length  to  <  |  w  \  <  2m  —  1 ,  as  shown  in  the  proof  of  the 
pumping  lemma,  w  can  be  “pumped  up”  to  produce  an  infinite  set  of  strings.  Suppose  now 
that  L  is  infinite.  Either  it  contains  a  string  w  of  length  m  <  |iu|  <  2 to  —  1  or  it  does  not. 


Figure  4. 1 8  Diagram  illustrating  the  pumping  lemma. 
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In  the  first  case,  we  are  done.  In  the  second  case,  w  >  2m  and  we  apply  the  pumping 
lemma  to  it  to  find  another  shorter  string  that  is  also  in  L,  contradicting  the  hypothesis  that 
it  was  the  shortest  string  of  length  greater  than  or  equal  to  2m.  ■ 

4.6  Properties  of  Regular  Languages 

Section  4.4  established  the  equivalence  of  regular  languages  (recognized  by  finite-state  ma¬ 
chines)  and  the  languages  denoted  by  regular  expressions.  We  now  present  properties  satisfied 
by  regular  languages.  We  say  that  a  class  of  languages  is  closed  under  an  operation  if  ap¬ 
plying  that  operation  to  a  language  (or  languages)  in  the  class  produces  another  language  in 
the  class.  For  example,  as  shown  below,  the  union  of  two  regular  languages  is  another  regular 
language.  Similarly,  the  Kleene  closure  applied  to  a  regular  language  returns  another  regular 
language. 

Given  a  language  L  over  an  alphabet  S,  the  complement  of  L  is  the  set  L  =  E*  —  L, 
the  strings  that  are  in  S*  but  not  in  L.  (This  is  also  called  the  difference  between  £*  and  L .) 
The  intersection  of  two  languages  L\  and  L2,  denoted  L\  D  L2,  is  the  set  of  strings  that  are 
in  both  languages. 

THEOREM  4.6. 1  The  class  of  regular  languages  is  closed  under  the  following  operations: 

•  concatenation 

•  union 

•  Kleene  closure 

•  complementation 

•  intersection 

Proof  In  Section  4.4  we  showed  that  the  languages  denoted  by  regular  expressions  are  ex¬ 
actly  the  languages  recognized  by  finite-state  machines  (deterministic  or  nondeterministic). 
Since  regular  expressions  are  defined  in  terms  of  concatenation,  union,  and  Kleene  closure, 
they  are  closed  under  each  of  these  operations. 

The  proof  of  closure  of  regular  languages  under  complementation  is  straightforward.  If 
L  is  regular  and  has  an  associated  FSM  M  that  recognizes  it,  make  all  final  states  of  M  non¬ 
final  and  all  non-final  states  final.  This  new  machine  then  recognizes  exactly  the  complement 
of  L.  Thus,  L  is  also  regular. 

The  proof  of  closure  of  regular  languages  under  intersection  follows  by  noting  that  if  L\ 
and  L2  are  regular  languages,  then 


L\  n  L2  —  L\  U  L2 

that  is,  the  intersection  of  two  sets  can  be  obtained  by  complementing  the  union  of  their 
complements.  Since  each  of  L \  and  L2  is  regular,  as  is  their  union,  it  follows  that  L\  U  L2 
is  regular.  (See  Fig.  4.19(a).)  Finally,  the  complement  of  a  regular  set  is  regular.  ■ 

When  we  come  to  study  Turing  machines  in  Chapter  5,  we  will  show  that  there  are  well- 
defined  languages  that  have  no  machine  to  recognize  them,  even  if  the  machine  has  an  infinite 
amount  of  storage  available.  Thus,  it  is  interesting  to  ask  if  there  are  algorithms  that  solve 
certain  decision  problems  about  regular  languages  in  a  finite  number  of  steps.  (Machines  that 
halt  on  all  input  are  said  to  implement  algorithms.)  As  shown  above,  there  are  algorithms 
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Figure  4. 1 9  (a)  The  intersection  L\  H  L2  of  two  sets  L\  and  L2  can  be  obtained  by  taking  the 


complement  Li  U  L2  of  the  union  L 1  U  L2  of  their  complements,  (b)  If  L(M\)  C  LIM2),  then 
L(Mi)  n  L(M2)  =  0. 


that  can  recognize  the  concatenation,  union  and  Kleene  closure  of  regular  languages.  We  now 
show  that  algorithms  exist  for  a  number  of  decision  problems  concerning  finite-state  machines. 

THEOREM  4.6.2  There  are  algorithms  for  each  of  the  following  decision  problems: 

a)  For  a  finite-state  machine  M  and  a  string  w,  determine  if  w  £  L(M). 

b)  For  a  finite-state  machine  M,  determine  if  L(M)  =  0. 

c)  For  a  finite-state  machine  M,  determine  if  L(M )  =  £*. 

d)  For  finite-state  machines  M\  and  M2,  determine  if  L(M\)  C  L(M2). 

e)  For  finite-state  machines  M\  and  M2,  determine  if  =  L(M2). 

Proof  To  answer  (a)  it  suffices  to  supply  w  to  a  deterministic  finite-state  machine  equiva¬ 
lent  to  M  and  observe  the  final  state  after  it  has  processed  all  letters  in  w.  The  number  of 
steps  executed  by  this  machine  is  the  length  of  w.  Question  (6)  is  answered  in  Lemma  4.5.2. 
We  need  only  determine  if  the  language  contains  strings  of  length  less  than  to,  where  m  is 
the  number  of  states  of  M .  This  can  be  done  by  trying  all  inputs  of  length  less  than  to. 

The  answer  to  question  (c)  is  the  same  as  the  answer  to  “Is  L{M)  =  0?”  The  answer  to 

question  ( d )  is  the  same  as  the  answer  to  “Is  L(M\ )  D  L(M2)  =  0?”  (See  Fig.  4.19(b).) 
Since  FSMs  that  recognize  the  complement  and  intersection  of  regular  languages  can  be 
constructed  in  a  finite  number  of  steps  (see  the  proof  of  Theorem  4.6.1),  we  can  use  the 
procedure  for  (6)  to  answer  the  question.  Finally,  the  answer  to  question  (e)  is  “yes”  if  and 
only  if  L(M\)  C  L(M2)  and  L(M2)  C  L^Mf).  ■ 

4.7  State  Minimization* 

Given  a  finite-state  machine  M,  it  is  often  useful  to  have  a  potentially  different  DFSM  Mm;n 
with  the  smallest  number  of  states  (a  minimal-state  machine)  that  recognizes  the  same  language 
L(M).  In  this  section  we  develop  a  procedure  to  find  such  a  machine  recognizing  a  regular 
language  L.  As  a  step  in  this  direction,  we  define  a  natural  equivalence  relation  Rl  for  each  lan¬ 
guage  L  and  show  that  L  is  regular  if  and  only  if  Rl  has  a  finite  number  of  equivalence  classes. 

4.7. 1  Equivalence  Relations  on  Languages  and  States 

The  relation  Rl  is  used  to  define  a  machine  Ml.  When  L  is  regular,  we  show  that  Ml  is  a 
minimal-state  DFSM.  We  also  give  an  explicit  procedure  to  construct  a  minimal-state  DFSM 
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recognizing  a  regular  language  L.  The  approach  is  the  following:  a)  given  a  regular  expression, 
an  NFSM  is  constructed  (Theorem  4.4.1);  b)  an  equivalent  DFSM  is  then  produced  (Theo¬ 
rem  4.2.1);  c)  equivalent  states  of  this  DFSM  are  discovered  and  coalesced,  thereby  producing 
the  minimal  machine.  We  begin  our  treatment  with  a  discussion  of  equivalence  relations. 

DEFINITION  4.7.1  An  equivalence  relation  R  on  a  set  A  is  a  partition  of  the  elements  of  A  into 
disjoint  subsets  called  equivalence  classes.  If  two  elements  a  and  b  are  in  the  same  equivalence 
class  under  relation  R,  we  write  aRb.  If  a  is  an  element  of  an  equivalence  class,  we  represent  its 
equivalence  class  by  [a].  An  equivalence  relation  is  represented  by  its  equivalence  classes. 

An  example  of  equivalence  relation  on  the  set  A  =  {0,  1,2,3}  is  the  set  of  equivalence 
classes  {{0,  2},  {1,  3}}.  Then,  [0]  and  [2]  denote  the  same  equivalence  class,  namely  {0,2}, 
whereas  [1]  and  [2]  denote  different  equivalence  classes. 

Equivalence  relations  can  be  defined  on  any  set,  including  the  set  of  strings  over  a  finite 
alphabet  (a  language).  For  example,  let  the  partition  {0*,  0(0*10*)+,  1(0  +  1)*}  of  the 
set  (0  +  1)*  denote  the  equivalence  relation  R.  The  equivalence  classes  consist  of  strings 
containing  zero  or  more  0’s,  strings  starting  with  0  and  containing  at  least  one  1,  and  strings 
beginning  with  1.  It  follows  that  Q0R000  and  1001 ATI  but  not  that  10f?01. 

Additional  conditions  can  be  put  on  equivalence  relations  on  languages.  An  important 
restriction  is  that  an  equivalence  relation  be  right- invariant  (with  respect  to  concatenation). 

DEFINITION  4.7.2  An  equivalence  relation  R  over  the  alphabetic  wright-invariant  (with  respect 
to  concatenation)  if  for  all  u  and  v  in  E*,  uRv  implies  uzRvz  for  all  z  £  E*. 

For  example,  let  R  =  {(10*1  +  0)*,  0*  1(10*1  +  0)*}.  That  is,  R  consists  of  two  equiv¬ 
alence  classes,  the  set  containing  strings  with  an  even  number  of  l’s  and  the  set  containing 
strings  with  an  odd  number  of  Is.  R  is  right-invariant  because  if  uRv,  that  is,  if  the  numbers 
of  l’s  in  u  and  v  are  both  even  or  both  odd,  then  the  same  is  true  of  uz  and  vz  for  each 
2  £  E*,  that  is,  uzRvz. 

To  each  language  L,  whether  regular  or  not,  we  associate  the  natural  equivalence  relation 
Rl  defined  below.  Problem  4.30  shows  that  for  some  languages  Rl  has  an  unbounded  number 
of  equivalence  classes. 

DEFINITION  4.7.3  Given  a  language  L  over  E,  the  equivalence  relation  Rl  is  defiled  as  follows: 
strings  u,  v  £  E*  are  equivalent,  that  is,  uRlV,  if  and  only  if  for  each  z  £  E*,  either  both  uz 
and  vz  are  in  L  or  both  are  not  in  L. 

The  equivalence  relation  R  =  {(10*1+0)*,  0*1(10*1+0)*}  given  above  is  the  equivalence 
relation  Rl  for  both  the  language  £  =  (10*1+0)*  and  the  language  L  =  0*1(10*1  +  0)*. 

A  natural  right-invariant  equivalence  relation  on  strings  can  also  be  associated  with  each 
DFSM,  as  shown  below.  This  relation  defines  two  strings  as  equivalent  if  they  carry  the  ma¬ 
chine  from  its  initial  state  to  the  same  state.  Thus,  for  each  state  there  is  an  equivalence  class 
of  strings  that  take  the  machine  to  that  state.  For  this  purpose  we  extend  the  state  transition 
function  S  to  strings  a  £  E*  recursively  by  S(q,e)  =  q  and  S(q,aa)  =  S(5(q,<r),a)  for 
a  £  E. 

DEFINITION  4.7.4  Given  a  DFSM M  =  (E,  Q,  S,  s,  F),  RM  is  the  equivalence  relation  defined 
as  follows:  for  all  u,v  £  E*,  uRmV  ifandonlyif  S(s,u)  =  S(s,v).  (Note  that  5  (q,  e)  =  q.) 
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It  is  straightforward  to  show  that  the  equivalence  relations  Rl  and  Rm  are  right-invariant. 
(See  Problems  4.28  and  4.29.)  It  is  also  clear  that  Rm  has  as  many  equivalence  classes  as  there 
are  accessible  states  of  M. 

Before  we  present  the  major  results  of  this  section  we  define  a  special  machine  Ml  that 
will  be  seen  to  be  a  minimal  machine  recognizing  the  language  L. 

DEFINITION  4.7.5  Given  the  language  L  over  the  alphabet  E  with  finite  Rl,  the  DFSM  Ml  = 
(E,  Ql,  Sl ,  Sl,  Fl)  is  defined  in  terms  of  the  right-invariant  equivalence  relation  Rl  as  follows: 
a)  the  states  Ql  are  the  equivalence  classes  of  Rl;  b)  the  initial  state  sl  is  the  equivalence  class 
[e];  c)  the  final  states  Fl  are  the  equivalence  classes  containing  strings  in  the  language  L;  d)  for  an 
arbitrary  equivalence  class  [it]  with  representative  element  u  £  E*  and  an  arbitrary  input  letter 
a  £  E,  the  next-state  transition  function  Sl  '■  Ql  xEh  Ql  is  defined  by  Sl  ( [it] ,  a)  =  [ita]. 

For  this  definition  to  make  sense  we  must  show  that  condition  c)  does  not  contradict  the 
facts  about  Rl:  that  an  equivalence  class  containing  a  string  in  L  does  not  also  contain  a 
string  that  is  not  in  L.  But  by  the  definition  of  Rl,  if  we  choose  z  =  e,  we  have  that  uRlV 
only  if  both  it  and  v  are  in  L.  We  must  also  show  that  the  next-state  function  definition  is 
consistent:  it  should  not  matter  which  representative  of  the  equivalence  class  [it]  is  used.  In 
particular,  if  we  denote  the  class  [it]  by  [u]  for  v  another  member  of  the  class,  it  should  follow 
that  [ita]  =  [»a].  But  this  is  a  consequence  of  the  definition  of  Rl. 

Figure  4.20  shows  the  machine  Ml  associated  with  L  =  (10*1  +  0)*.  The  initial  state 
is  associated  with  [e],  which  is  in  the  language.  Thus,  the  initial  state  is  also  a  final  state.  The 
state  associated  with  [0]  is  also  [e]  because  e  and  0  are  both  in  L.  Thus,  the  transition  from  state 
[e]  on  input  0  is  back  to  state  [e].  Problem  4.31  asks  the  reader  to  complete  the  description  of 
this  machine. 

We  need  the  notion  of  a  refinement  of  an  equivalence  relation  before  we  establish  condi¬ 
tions  for  a  language  to  be  regular. 

DEFINITION  4.7.6  An  equivalence  relation  R  over  a  set  A  is  a  refinement  of  an  equivalence 
relation  S  over  the  same  set  ifaRb  implies  that  aSb.  A  refinement  RofS  is  strict  if  there  exist 
a,b  £  A  such  that  aSb  but  it  is  not  true  that  aRb. 

Over  the  set  A  =  {a,  b,  c,  d},  the  relation  R  =  {{a},  {6},  {c,  d}}  is  a  strict  refinement 
of  the  relation  S  =  {{a,  &},  {c,  d}}.  Clearly,  if  R  is  a  refinement  of  S,  R  has  no  fewer 
equivalence  classes  than  does  S.  If  the  refinement  R  of  S  is  strict,  R  has  more  equivalence 
classes  than  does  S. 
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Figure  4.20  The  machine  Ml  associated  with  L  =  (10*1  +  0)*. 
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4.7.2  The  Myhill-Nerode  Theorem 

The  following  theorem  uses  the  notion  of  refinement  to  give  conditions  under  which  a  lan¬ 
guage  is  regular. 

THEOREM  4.7. 1  (Myhill-Nerode)  L  is  a  regular  language  if  and  only  if  Rl  has  a  finite  num¬ 
ber  of  equivalence  classes.  Furthermore,  if  L  is  regular,  it  is  the  union  of  some  of  the  equivalence 
classes  of  Rl . 

Proof  We  begin  by  showing  that  if  L  is  regular,  Rl  has  a  finite  number  of  equivalence 
classes.  Let  L  be  recognized  by  the  DFSM  M  =  (£,  Q,  6,  s,  F).  Then  the  number  of 
equivalence  classes  of  Rm  is  finite.  Consider  two  strings  it,  v  £  S*  that  are  equivalent 
under  Rm  ■  By  definition,  it  and  v  carry  M  from  its  initial  state  to  the  same  state,  whether 
final  or  not.  Thus,  uz  and  vz  also  carry  M  to  the  same  state.  It  follows  that  Rm  is  right- 
invariant.  Because  uRmV,  either  it  and  v  take  M  to  a  final  state  and  are  in  L  or  they  take 
M  to  a  non-final  state  and  are  not  in  L.  It  follows  from  the  definition  of  Rl  that  uRlV. 
Thus,  Rm  is  a  refinement  of  Rl-  Consequently,  Rl  has  no  more  equivalence  classes  than 
does  Rm  and  this  number  is  finite. 

Now  let  Rl  have  a  finite  number  of  equivalence  classes.  We  show  that  the  machine 
Ml  recognizes  L.  Since  it  has  a  finite  number  of  states,  we  are  done.  The  proof  that  Ml 
recognizes  L  is  straightforward.  If  [to]  is  a  final  state,  it  is  reached  by  applying  to  Ml  in 
its  initial  state  a  string  in  [in] .  Since  the  final  states  are  the  equivalence  classes  containing 
exactly  those  strings  that  are  in  L,  Ml  recognizes  L.  It  follows  that  if  L  is  regular,  it  is  the 
union  of  some  of  the  equivalence  classes  of  Rl  ■  ■ 

We  now  state  an  important  corollary  of  this  theorem  that  identifies  a  minimal  machine 
recognizing  a  regular  language  L.  Two  DFSMs  are  isomorphic  if  they  differ  only  in  the  names 
given  to  states. 

COROLLARY  4.7. 1  If  L  is  regular,  the  machine  Ml  is  a  minimal  DFSM  recognizing  L.  All  other 
such  minimal  machines  are  isomorphic  to  Ml- 

Proof  From  the  proof  of  Theorem  4.7.1,  if  M  is  any  DFSM  recognizing  L,  it  has  no  fewer 
states  than  there  are  equivalence  classes  of  Rl,  which  is  the  number  of  states  of  Ml-  Thus, 
Ml  has  a  minimal  number  of  states. 

Consider  another  minimal  machine  M0  =  (S,  Q0,  S0,  So,  F0).  Each  state  of  M0  can 
be  identified  with  some  state  of  Ml-  Equate  the  initial  states  of  Ml  and  Mq  and  let  q  be 
an  arbitrary  state  of  Mq.  There  is  some  string  u  £  E*  such  that  q  =  So(so,u).  (If  not, 
Mo  is  not  minimal.)  Equate  state  q  with  state  5l{sl,u)  =  [it]  of  Ml-  Let  v  £  [it]. 
If  do(soji’)  7  -ALo  has  more  states  than  does  Ml,  which  is  a  contradiction.  Thus,  the 
identification  of  states  in  these  two  machines  is  consistent.  The  final  states  Fq  of  M0  are 
identified  with  those  equivalence  classes  of  Ml  that  contain  strings  in  L. 

Consider  now  the  next-state  function  So  of  Mq.  Let  state  q  of  Mq  be  identified  with 
state  [it]  of  Ml  and  let  a  be  an  input  letter.  Then,  if  So(q,a)  =  p,  it  follows  that  p  is 
associated  with  state  [ita]  of  Ml  because  the  input  string  ua  maps  So  to  state  p  in  Mq  and 
maps  Si  to  [ua\  in  Ml.  Thus,  the  next-state  functions  of  the  two  machines  are  identical 
up  to  a  renaming  of  the  states  of  the  two  machines.  ■ 
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4.7.3  A  State  Minimization  Algorithm 

The  above  approach  does  not  offer  a  direct  way  to  find  a  minimal-state  machine.  In  this  sec¬ 
tion  we  give  a  procedure  for  this  purpose.  Given  a  regular  language,  we  construct  an  NFSM 
that  recognizes  it  (Theorem  4.4.1)  and  then  convert  the  NFSM  to  an  equivalent  DFSM  (The¬ 
orem  4.2.1).  Once  we  have  such  a  DFSM  M,  we  give  a  procedure  to  minimize  the  number  of 
states  based  on  combining  equivalence  classes  of  the  right-invariant  equivalence  relation  Rm 
that  are  indistinguishable.  (These  equivalence  classes  are  sets  of  states  of  M.)  The  resulting 
machine  is  isomorphic  to  Ml,  the  minimal-state  machine. 

DEFINITION  4.7.7  Let  M  =  (E,  Q,  S,  s,  F)  be  a  DFSM.  The  equivalence  relation  =n  on  states 
in  Q  is  defined  as  follows:  two  states  p  and  q  ofM  are  n-indistinguishable  (denoted  p  =n  q)  if 
and  only  if for  all  input  strings  ttgE*  of  length  |iz|  <  n  either  both  5(p,  u)  and  S(q,  u)  are  in 
F  or  both  are  not  in  F.  (We  write  p  ^ n  q  if  p  and  q  are  not  n-indistinguishable.)  Two  states  p 
and  q  are  equivalent  (denoted  p  =  q)  if  they  are  n-indistinguishable  for  all  n  >  0. 

For  arbitrary  states  q\,  q2,  and  q$,  if  q\  and  q2  are  n-indistinguishable  and  <72  and  (73  are 
n-indistinguishable,  then  q\  and  <73  are  n-indistinguishable.  Thus,  all  three  states  are  in  the 
same  set  of  the  partition  and  =n  is  an  equivalence  relation.  By  an  extension  of  this  type  of 
reasoning  to  all  values  of  n,  it  is  also  clear  that  =  is  an  equivalence  relation. 

The  following  lemma  establishes  that  =j+i  refines  =j  and  that  for  some  k  and  all  j  >  k, 
=j  is  identical  to  =k,  which  is  in  turn  equal  to  =. 

LEMMA  4.7. 1  Let  M  =  (E,  Q,  S,  s,  F)  be  an  arbitrary  DFSM.  Over  the  set  Q  the  equivalence 
relation  =n+i  is  a  refinement  of  the  relation  =n.  Furthermore,  if for  some  k  <  \Q\  —  2,  =k+i 
and=k  are  equal,  then  so  are  =j+]  and  =j  for  all  j  >  k.  In  particular,  =k  and  =  are  identical. 

Proof  Ifp  =„_|_i  q  then  p  =n  q  by  definition.  Thus,  for  n  >  0  =n+i  refines  =n. 

We  now  show  that  if =fc+i  and  =k  are  equal,  then  =j+i  and  =j  are  equal  for  all  j  >k. 
Suppose  not.  Let  l  be  the  smallest  value  of  j  for  which  =j+i  and  =j  are  equal  but  =j+ 2  and 
=j+ 1  are  not  equal.  It  follows  that  there  exist  two  states  p  and  q  that  are  indistinguishable 
for  input  strings  of  length  l  +  1  or  less  but  are  distinguishable  for  some  input  string  v  of 
length  ©  =  1  +  2.  Let  v  =  au  where  a  £  E  and  |m|  =  1  +  1.  Since  S(p,  v)  =  5(S(p,  a),  u) 
and  S(q,  v)  =  S(S(q,  a),u),  it  follows  that  the  states  S(p,  a)  and  S(q,a)  are  distinguishable 
by  some  string  u  of  length  l  +  1  but  not  by  any  string  of  length  l.  But  this  contradicts  the 
assumption  that  =1+1  and  =/  are  equal. 

The  relation  =0  has  two  equivalence  classes,  the  final  states  and  all  other  states.  For  each 
integer  j  <  k,  where  k  is  the  smallest  integer  such  that  =fc+i  and  =k  are  equal,  =j  has  at 
least  one  more  equivalence  class  than  does  =j- 1  ■  That  is,  it  has  at  least  j  +  2  classes.  Since 
=fc  can  have  at  most  |Q|  equivalence  classes,  it  follows  that  k  +  2<\Q\. 

Clearly,  =/.  and  =  are  identical  because  if  two  states  cannot  be  distinguished  by  input 
strings  of  length  k  or  less,  they  cannot  be  distinguished  by  input  strings  of  any  length.  ■ 

The  proof  of  this  lemma  provides  an  algorithm  to  compute  the  equivalence  relation  =, 
namely,  compute  the  relations  =j,  0  <  j  <  \Q\  —  2  in  succession  until  we  find  two  relations 
that  are  identical.  We  find  =j+\  from  =j  as  follows:  for  every  pair  of  states  (p ,  q)  in  an 
equivalence  class  of  =j,  we  find  their  successor  states  5(p,a )  and  S(q,a )  under  input  letter 
a  for  each  such  letter.  If  for  all  letters  a,  5(p,a)  =j  5{q,a)  and  p  =j  q,  then  p  =3+1  q 
because  we  cannot  distinguish  between  p  and  q  on  inputs  of  length  j  +  1  or  less.  Thus,  the 
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algorithm  compares  each  pair  of  states  in  an  equivalence  class  of  =j  and  forms  equivalence 
classes  of  =j+i  by  grouping  together  states  whose  successors  under  input  letters  are  in  the 
same  equivalence  class  of  =j . 

To  illustrate  these  ideas,  consider  the  DFSM  of  Fig.  4.14.  The  equivalence  classes  of =o  are 
{{so,  fell  {<7i 3  <?2>  93}}-  Since  5(s 0,  0)  and  S(qn,  0)  are  different,  So  and  qR  are  in  different 
equivalence  classes  of =i- Also,  because  £((73,  0)  =  qR  and  d(gi,  0)  =  S(q2,0)  =  q\  £  F,q3is 
in  a  different  equivalence  class  of =1  from  q\  and  (72  •  The  latter  two  states  are  in  the  same  equiv¬ 
alence  class  because  S(qu  1)  =  S(q2,  1 )  =  qR  £  F.  Thus,  =1=  {{s0},  {to},  {©},  {91, 92}}- 
The  only  one  of  these  equivalence  classes  that  could  be  refined  is  the  last  one.  However,  since 
we  cannot  distinguish  between  the  two  states  in  this  class  under  any  input,  no  further  refine¬ 
ment  is  possible  and  =  =  =1. 

We  now  show  that  if  two  states  are  equivalent  under  =,  they  can  be  combined,  but  if  they 
are  distinguishable  under  =,  they  cannot.  Applying  this  procedure  provides  a  minimal-state 
DFSM. 

DEFINITION  4.7.8  Let  M  =  (£,  Q,  S,  s,  F)  be  a  DFSM  and  let  =  be  the  equivalence  relation 
defined  above  over  Q.  The  DFSM  M=  =  (£,  Q=,  5=,  [s],  F=)  associated  with  the  relation  = 
is  defined  as  follows:  a)  the  states  Q=  are  the  equivalence  classes  of =;  b)  the  initial  state  of  M= 
is  [s];  c)  the  final  states  F=  are  the  equivalence  classes  containing  states  in  F;  d)  for  an  arbitrary 
equivalence  class  [g]  with  representative  element  q  £  Q  and  an  arbitrary  input  letter  a  £  E,  the 
next-state  function  5=  :  Q=  xSh  Q=  is  defined  by  <5=([g],  a)  =  [<5(g,  a)]. 

This  definition  is  consistent;  no  matter  which  representative  of  the  equivalence  class  [9]  is 
used,  the  next  state  on  input  a  is  [<5(g,  a)].  It  is  straightforward  to  show  that  M=  recognizes 
the  same  language  as  does  M.  (See  Problem  4.27.)  We  now  show  that  M=  is  a  minimal-state 
machine. 

THEOREM  4.7.2  M=  is  a  minimal-state  machine. 

Proof  Let  M  =  (S,  Q,  S,  s,  F)  be  a  DFSM  recognizing  L  and  let  M=  be  the  DFSM 
associated  with  the  equivalence  relation  =  on  Q.  Without  loss  of  generality,  we  assume 
that  all  states  of  M=  are  accessible  from  the  initial  state.  We  now  show  that  M=  has  no 
more  states  than  Mr.  Suppose  it  has  more  states.  That  is,  suppose  M=  has  more  states 
than  there  are  equivalence  classes  of  Rr.  Then,  there  must  be  two  states  p  and  q  of  M 
such  that  [p\  fi  [g]  but  that  uRrV,  where  u  and  v  carry  M  from  its  initial  state  to  p  and 
g,  respectively.  (If  this  were  not  the  case,  any  strings  equivalent  under  Rr  would  carry  M 
from  its  initial  state  S  to  equivalent  states,  contradicting  the  assumption  that  M=  has  more 
states  thanM/,.)  But  if  uRrV,  then  since  Rr  is  right-invariant,  uwRrVW  for  all  w  £  E*. 
However,  because  [p]  fi  [g],  there  is  some  z  £  T,*  such  that  [p]  and  [g]  can  be  distinguished. 
This  is  equivalent  to  saying  that  uzRrVZ  does  not  hold,  a  contradiction.  Thus,  M=  and 
Mr  have  the  same  number  of  states.  Since  M=  recognizes  L,  it  is  a  minimal-state  machine 
equivalent  to  M .  ■ 

As  shown  above,  the  equivalence  relation  =  for  the  DFSM  of  Fig.  4.14  is  =  is  {{so}, 
{9it}>  {93},  {91,92}}-  The  DFSM  associated  with  this  relation,  M=,  is  shown  in  Fig.  4.21. 
It  clearly  recognizes  the  language  10*  +  0.  It  follows  that  the  equivalent  DFSM  of  Fig.  4.14  is 
not  minimal. 
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Figure  4.2  I  A  minimal-state  DFSM  equivalent  to  the  DFSM  in  Fig.  4.14. 


4.8  Pushdown  Automata 

The  pushdown  automaton  (PDA)  has  a  one-way,  read-only,  potentially  infinite  input  tape  on 
which  an  input  string  is  written  (see  Fig.  4.22);  its  head  either  advances  to  the  right  from  the 
leftmost  cell  or  remains  stationary.  It  also  has  a  stack,  a  storage  medium  analogous  to  the  stack 
of  trays  in  a  cafeteria.  The  stack  is  a  potentially  infinite  ordered  collection  of  initially  blank 
cells  with  the  property  that  data  can  be  pushed  onto  it  or  popped  from  it.  Data  is  pushed  onto 
the  top  of  the  stack  by  moving  all  existing  entries  down  one  cell  and  inserting  the  new  element 
in  the  top  location.  Data  is  popped  by  removing  the  top  element  and  moving  all  other  entries 
up  one  cell.  The  control  unit  of  a  pushdown  automaton  is  a  finite-state  machine.  The  full 
power  of  the  PDA  is  realized  only  when  its  control  unit  is  nondeterministic. 

DEFINITION  4.8. 1  A  pushdown  automaton  (PDA)  is  a  six-tuple  M  =  (£,  T,  Q,  A,  s,  F), 

where  £  is  the  tape  alphabet  containing  the  blank  symbol  (3,  T  is  the  stack  alphabet  containing 
the  blank  symbol  7,  Q  is  the  finite  set  of  states,  A  C  (Qx  (£U{e})  x  (ru{e})  X  Qx  (ru{e})) 

is  the  set  of  transitions,  s  is  the  initial  state,  and  F  is  the  set  of  final  states.  We  now  describe 
transitions. 

If  for  state  p,  tape  symbol  x,  and  stack  symbol  y  the  transition  ( p ,  x,  y\  q,  z )  £  A,  then  if  M 
is  in  state  p,  x  £  £  is  under  its  tape  head,  and  y  £  T  is  at  the  top  of  its  stack,  M  may  pop  y  fro?n 
its  stack,  enter  state  q  £  Q,  and  push  z  £  T  onto  its  stack.  However,  if  x  =  e,  y  =  e  or  z  =  e, 
then  M  does  not  read  its  tape,  pop  its  stack  or  push  onto  its  stack,  respectively.  The  head  on  the  tape 
either  remains  stationary  if  x  =  e  or  advances  one  cell  to  the  right  ifx  ft  e. 

If  at  each  point  in  time  a  unique  transition  ( p ,  x,  y,  q,  z)  may  be  applied,  the  PDA  is  deter¬ 
ministic.  Otherwise  it  is  nondeterministic. 

The  PDA  M  accepts  the  input  string  w  £  £*  if  when  started  in  state  s  with  an  empty 
stack  (its  cells  contain  the  blank  stack  symbol  7)  and  w  placed  left-adjusted  on  its  otherwise  blank 
tape  (its  blank  cells  contain  the  blank  tape  symbol  (3),  the  last  state  entered  by  M  after  reading 
the  components  of  w  and  no  other  tape  cells  is  a  member  of  the  set  F.  M  accepts  the  language 
L(M)  consisting  of  all  such  strings. 

Some  of  the  special  cases  for  the  action  of  the  PDA  M  on  empty  tape  or  stack  sym¬ 
bols  are  the  following:  if  (p,x,e;q,  z),  x  is  read,  state  q  is  entered,  and  2  is  pushed  onto 
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Figure  4.22  The  control  unit,  one-way  input  tape,  and  stack  of  a  pushdown  automaton. 


the  stack;  if  (p,x,y;  q,e),  x  is  read,  state  q  is  entered,  and  y  is  popped  from  the  stack; 
if  ( p,e,y ;  q,  z),  no  input  is  read,  y  is  popped,  z  is  pushed  and  state  q  is  entered.  Also,  if 
(p,e,e-  q,  e),  M  moves  from  state  p  to  q  without  reading  input,  or  pushing  or  popping  the 
stack. 

Observe  that  if  every  transition  is  of  the  form  ( p ,  x,  e;  q,  e),  the  PDA  ignores  the  stack  and 
simulates  an  FSM.  Thus,  the  languages  accepted  by  PDAs  include  the  regular  languages. 

We  emphasize  that  a  PDA  is  nondeterministic  if  for  some  state  q,  tape  symbol  x,  and  top 
stack  item  y  there  is  more  than  one  transition  that  M  can  make.  For  example,  if  A  contains 
(s,  a,  e;  s,  a)  and  (s,  a,  a;  r,  e),  M  has  the  choice  of  ignoring  or  popping  the  top  of  the  stack 
and  of  moving  to  state  s  or  r.  If  after  reading  all  symbols  of  w  M  enters  a  state  in  F,  then  M 
accepts  w. 

We  now  give  two  examples  of  PDAs  and  the  languages  they  accept.  The  first  accepts 
palindromes  of  the  form  {wcwR},  where  wR  is  the  reverse  of  w  and  w  £  {a,  6}*.  The  state 
diagram  of  its  control  unit  is  shown  in  Fig.  4.23.  The  second  PDA  accepts  those  strings  over 
{a,  b}  of  the  form  anbm  for  which  n  >  m. 

EXAMPLE  4.8. 1  The  PDA  M  =  (£,  T,  Q,  A,  s,  F),  where  £  =  {a,  b,  c,  (3},  T  =  {a,  b,  7}, 
Q  =  {s,p,r,  /},  F  =  {/}  and  A  contains  the  transitions  shown  in  Fig.  4.24,  accepts  the 
language  L  =  {wcwR}. 

The  PDA  M  of  Figs.  4.23  and  4.24  remains  in  the  stacking  state  s  while  encountering 
as  and  b’s  on  the  input  tape,  pushing  these  letters  (the  order  of  these  letters  on  the  stack  is  the 
reverse  of  their  order  on  the  input  tape)  onto  the  stack  (Rules  (a)  and  (b)).  If  it  encounters  an 
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a,  a;  e 


b,  6;  e 


Start 


c,  e;  e 


Figure  4.23  State  diagram  for  the  pushdown  automaton  of  Fig.  4.24  which  accepts  { wcwR }. 
An  edge  label  a,  6;  c  between  states  p  and  q  corresponds  to  the  transition  (p,  a,  6;  q,  c). 


instance  of  letter  c  while  in  state  s,  it  enters  the  possible  accept  state  p  (Rule  (c))  but  enters 
the  reject  state  r  if  it  encounters  a  blank  on  the  input  tape  (Rule  (d)).  While  in  state  p  it 
pops  an  a  or  b  that  matches  the  same  letter  on  the  input  tape  (Rules  (e)  and  (f)).  If  the  PDA 
discovers  blank  tape  and  stack  symbols,  it  has  identified  a  palindrome  and  enters  the  accept 
state  /  (Rule  (g)).  On  the  other  hand,  if  while  in  state  p  the  tape  symbol  and  the  symbol  on 
the  top  of  the  stack  are  different  or  the  letter  c  is  encountered,  the  PDA  enters  the  reject  state 
r  (Rules  (h)-(n)).  Finally,  the  PDA  does  not  exit  from  either  the  reject  or  accept  states  (Rules 
(o)  and  (p)). 


Rule 

Comment 

Rule 

Comment 

(«) 

(s,  a,  e;  s,  a) 

push  a 

(<) 

(p,  b,  a;  r,  e) 

reject 

(b) 

(s,  b,  e;  s,b) 

push  b 

(j) 

(p,  P,  a;  r,  e) 

reject 

(c) 

(s,  c,  e\p,  e) 

accept? 

(k) 

(p,  P,  6;  r,  e) 

reject 

(d) 

(s,/?,e;  r,e) 

reject 

0 1 ) 

(p,  a,  j;  r,  e) 

reject 

(e) 

(p,  a,  a;  p,  e) 

accept? 

(m) 

(p,  b,  7;  r,  e) 

reject 

(/) 

(p,  b,  b-  p,  e) 

accept? 

(n) 

(p,  c,  e;  r,  e) 

reject 

(. 9 ) 

(p,  P,  7;  /,  e) 

accept 

(o) 

(r,  e,  e;  r,  e) 

stay  in  reject  state 

(h) 

(p,  a,  6;  r,  e) 

reject 

( V ) 

(f,e,e;f,e) 

stay  in  accept  state 

Figure  4.24  Transitions  for  the  PDA  described  by  the  state  diagram  of  Fig.  4.23. 
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Rule 

Comment 

(a) 

(■ ,  e) 

accept 

(b) 

(s,  a,  e;  s,  a) 

push  a 

(c) 

(s,b,  7;  r,e) 

reject 

(d) 

(s,  b,  a;  p,  e) 

pop  a,  enter  pop  state 

(e) 

(p,  b,  a;  p,  e) 

pop  a 

(/) 

(p,  b,  7;  r,  e) 

reject 

Figure  4.25  Transitions  for  a  PDA  that  accepts 


Rule 

Comment 

(9) 

(p,  (3,  a;  /,  e) 

accept 

( h ) 

{p,  (3, 7;  /,  e) 

accept 

(0 

(p,  a,  e;  r,  e) 

reject 

U) 

(/,e,  e;  f,e) 

stay  in  accept  state 

(AO 

(r,  e,  e;  r,  e) 

stay  in  reject  state 

language  {anbm  \  n  >  m  >  0}. 


EXAMPLE  4.8.2  The  PDA  M  =  (£,  T,  Q,  A,  s,  F),  where  £  =  {a,b,/3},  T  =  {a,b,  7}, 
Q  =  {s,p,r,  /},  F  =  {/}  and  A  contains  the  transitions  shown  in  Fig.  4.25,  accepts  the 
language  L  =  {anbm  \  n  >  m  >  0}.  The  state  diagram  for  this  machine  is  shown  in  Fig.  4.26. 

The  rules  of  Fig.  4.25  work  as  follows.  An  empty  input  in  the  stacking  state  s  is  accepted 
(Rule  (a)).  If  a  string  of  as  is  found,  the  PDA  remains  in  state  s  and  the  a’s  are  pushed  onto 
the  stack  (Rule  (b)).  At  the  first  discovery  of  a  b  in  the  input  while  in  state  s,  if  the  stack  is 
empty,  the  input  is  rejected  by  entering  the  reject  state  (Rule  (c)).  If  the  stack  is  not  empty, 
the  a  at  the  top  is  popped  and  the  PDA  enters  the  pop  state  p  (Rule  (d)).  If  while  in  p  a  b 
is  discovered  on  the  input  tape  when  an  a  is  found  at  the  top  of  the  stack  (Rule(e)),  the  PDA 
pops  the  a  and  stays  in  this  state  because  it  remains  possible  that  the  input  contains  no  more  b’s 
than  a’s.  On  the  other  hand,  if  the  stack  is  empty  when  a  b  is  discovered,  the  PDA  enters  the 
reject  state  (Rule  (f)).  If  in  state  p  the  PDA  discovers  that  it  has  more  a’s  than  b’s  by  reading 


b,  a;  e 


Figure  4.26  The  state  diagram  for  the  PDA  defined  by  the  tables  in  Fig.  4.25. 
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the  blank  tape  letter  [3  when  the  stack  is  not  empty,  it  enters  the  accept  state  /  (Rule  (g)).  If 
the  PDA  encounters  an  a  on  its  input  tape  when  in  state  p,  an  a  has  been  received  after  a  b 
and  the  input  is  rejected  (Rule  (i)).  After  the  PDA  enters  either  the  accept  or  reject  states,  it 
remains  there  (Rules  (j)  and  (k)). 

In  Section  4.12  we  show  that  the  languages  recognized  by  pushdown  automata  are  exactly 
the  languages  defined  by  the  context-free  languages  described  in  the  next  section. 


4.9  Formal  Languages 

Languages  are  introduced  in  Section  1.2.3.  A  language  is  a  set  of  strings  over  a  finite  set  E, 
with  |E|  >  2,  called  an  alphabet.  E*  is  the  language  of  all  strings  over  E  including  the  empty 
string  e,  which  has  zero  length.  The  empty  string  has  the  property  that  for  an  arbitrary  string 
w,  ew  =  w  =  we.  E+  is  the  set  E*  without  the  empty  string. 

In  this  section  we  introduce  grammars  for  languages,  rules  for  rewriting  strings  through 
the  substitution  of  substrings.  A  grammar  consists  of  alphabets  T  and  J\f  of  terminal  and 
non-terminal  symbols,  respectively,  a  designated  non-terminal  start  symbol,  plus  a  set  of  rules 
1Z  for  rewriting  strings.  Below  we  define  four  types  of  language  in  terms  of  their  grammars: 
the  phrase-structure,  context-sensitive,  context-free,  and  regular  grammars. 

The  role  of  grammars  is  best  illustrated  with  an  example  for  a  small  fragment  of  English. 
Consider  a  grammar  G  whose  non-terminals  J\f  contain  a  start  symbol  S  denoting  a  generic 
sentence  and  NP  and  VP  denoting  generic  noun  and  verb  phrases,  respectively.  In  turn,  assume 
that  A f  also  contains  non-terminals  for  adjectives  and  adverbs,  namely  AJ  and  AV.  Thus,  A f  = 
{S,  NP,  VP,  AJ,  AV,  N,  v}.  We  allow  the  grammar  to  have  the  following  words  as  terminals: 
T  =  {bob,  alice,  duck,  big,  smiles,  quacks,  loudly}.  Here  bob,  alice,  and  duck  are  nouns, 
big  is  an  adjective,  smiles  and  quacks  are  verbs,  and  loudly  is  an  adverb.  In  our  fragment  of 
English  a  sentence  consists  of  a  noun  phrase  followed  by  a  verb  phrase,  which  we  denote  by  the 
rule  S  — >  NP  VP.  This  and  the  other  rules  1Z  of  the  grammar  are  shown  below.  They  include 


rules  to  map 

non-terminals  to  terminals,  such  as  N  - 

->  bob 

S 

->  NP  VP 

N  - 

->  bob 

V 

->  smiles 

NP  - 

->  N 

N  - 

->  alice 

V 

->  quacks 

NP  - 

->  AJ  N 

N  - 

->  duck 

AV  - 

->  loudly 

VP  - 

->  V 

AJ  - 

-*■  big 

VP  - 

->  V  AV 

With  these  rules  the  following  strings  (sentences)  can  be  generated:  bob  smiles-,  big  duck 
quacks  loudly;  and  alice  quacks.  The  first  two  sentences  are  acceptable  English  sentences, 
but  the  third  is  not  if  we  interpret  alice  as  a  person.  This  example  illustrates  the  need  for  rules 
that  limit  the  rewriting  of  non-terminals  to  an  appropriate  context  of  surrounding  symbols. 

Grammars  for  formal  languages  generalize  these  ideas.  Grammars  are  used  to  interpret 
programming  languages.  A  language  is  translated  and  given  meaning  through  a  series  of  steps 
the  first  of  which  is  lexical  analysis.  In  lexical  analysis  symbols  such  as  a,  l,  i,  c,  e  are  grouped 
into  tokens  such  as  alice,  or  some  other  string  denoting  alice.  This  task  is  typically  done  with 
a  finite-state  machine.  The  second  step  in  translation  is  parsing,  a  process  in  which  a  tokenized 
string  is  associated  with  a  series  of  derivations  or  applications  of  the  rules  of  a  grammar.  For 
example,  big  duck  quacks  loudly,  can  be  produced  by  the  following  sequence  of  derivations: 
S  — »  NP  VP;  NP  — >  AJ  N;  AJ  big;  N  — >  duck;  VP  — »  V  AV;  V  — »  quacks;  AV  — >  loudly. 
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In  his  exploration  of  models  for  natural  language,  Noam  Chomsky  introduced  four  lan¬ 
guage  types  of  decreasing  expressibility,  now  called  the  Chomsky  hierarchy,  in  which  each 
language  is  described  by  the  type  of  grammar  generating  it.  These  languages  serve  as  a  basis  for 
the  classification  of  programming  languages.  The  four  types  are  the  phrase-structure  languages, 
the  context-sensitive  languages,  the  context-free  languages,  and  the  regular  languages. 

There  is  an  exact  correspondence  between  each  of  these  types  of  languages  and  particular 
machine  architectures  in  the  sense  that  for  each  language  type  T  there  is  a  machine  architecture 
A  recognizing  languages  of  type  T  and  for  each  architecture  A  there  is  a  type  T  such  that  all 
languages  recognized  by  A  are  of  type  T.  The  correspondence  between  language  and  architec¬ 
ture  is  shown  in  the  following  table,  which  also  lists  the  section  or  problem  where  the  result  is 
established.  Here  the  linear  bounded  automaton  is  a  Turing  machine  in  which  the  number 
of  tape  cells  that  are  used  is  linear  in  the  length  of  the  input  string. 


Level 

Language  Type 

Machine  Type 

Proof Location 

0 

phrase-structure 

Turing  machine 

Section  5.4 

1 

context-sensitive 

linear  bounded  automaton 

Problem  4.36 

2 

context-free 

nondet.  pushdown  automaton 

Section  4.12 

3 

regular 

finite-state  machine 

Section  4.10 

We  now  give  formal  definitions  of  each  of  the  grammar  types  under  consideration. 

4.9.1  Phrase-Structure  Languages 

In  Section  5.4  we  show  that  the  phrase-structure  grammars  defined  below  are  exactly  the  lan¬ 
guages  that  can  be  recognized  by  Turing  machines. 

DEFINITION  4.9. 1  A  phrase-structure  grammar  G  is  a  four-tuple  G  =  S)  where 

Af  and.  T  are  disjoint  alphabets  of  non-terminals  and  terminals,  respectively.  Let  V  =  Af  U  T. 
The  rules  1Z  form  a  finite  subset  of  V+  X  V*  (denoted  1Z  C  V+  x  V*)  where  for  every  rule 
( a,b )  £l Z,  a  contains  at  least  one  non-terminal  symbol.  The  symbol  S  £  A (is  the  start  symbol. 

If  {a,  b)  £  TZ  we  write  a  — >  b.  If  u  £  and  a  is  a  contiguous  substring  of  u,  then  u  can 
be  replaced  by  the  string  v  by  substituting  b  for  a.  If  this  holds,  we  write  u  =rc  v  and  call  it  an 
immediate  derivation.  Extending  this  notation,  if  through  a  sequence  of  immediate  derivations 
(called a  derivation!  u  =><3  Xi,  X\  =><3  X2,  •  •  • ,  xn  =><3  v  we  can  transform  u  to  v,  we 
write  u=>g  v  and  say  that  v  derives  from  u.  If  the  rules  7 Z  contain  ( a ,  a)  for  all  a  £  Af+ ,  the 
relation  =4  <3  is  called  the  transitive  closure  of  the  relation  =><3  and  u  =4  q  u  for  all  u  £  V* 
containing  at  least  one  non-terminal  symbol. 

The  language  L(G)  defined  by  the  grammar  G  is  the  set  of  all  terminal  strings  that  can  be 
derived  from  the  start  symbol  S;  that  is, 

L(G)  =  {u  £  T*  |  S  4>g  u} 

When  the  context  is  clear  we  drop  the  subscript  G  in  =>g  and  =><3.  These  definitions  are 
best  understood  from  an  example.  In  all  our  examples  we  use  letters  in  SMALL  CAPS  to  denote 
non-terminals  and  letters  in  italics  to  denote  terminals,  except  that  e,  the  empty  letter,  may 
also  be  a  terminal. 
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EXAMPLE  4.9. 1  Consider  the  grammar  G\  =  (Mi,  T\,  TZ\,  S),  where  M\  =  {s,  B,  c},  7)  = 
{a,  b,  c}  and  TZ\  consists  of  the  following  rules: 


a) 

S 

-»•  aSBC 

d) 

aB 

->  ab 

9 )  cC  - 

-»  CC 

b) 

s 

-»•  aBC 

e) 

6b  - 

->  bb 

c) 

CB  - 

->  BC 

/) 

be  - 

->  be 

Clearly  the  string  ctaBCBC  can  be  rewritten  as  aaBBCC  using  rule  (c),  that  is,  aaBCBC  => 
aaBBCC.  One  application  of  (d),  one  of  (e),  one  of  (f),  and  one  of  (g)  reduces  it  to  the  string 
aabbcc.  Since  one  application  of  (a)  and  one  of  (b)  produces  the  string  aaBBCC,  it  follows 
that  the  language  L{G\)  contains  aabbcc. 

Similarly,  two  applications  of  (a)  and  one  of  (b)  produce  aaaBCBCBC,  after  which  three 
applications  of  (c)  produce  the  string  aaaBBBCCC.  One  application  of  (d)  and  two  of  (e) 
produce  aaabbbCCC,  after  which  one  application  of  (f)  and  two  of  (g)  produces  aaabbbccc. 
In  general,  one  can  show  that  L(G\)  =  {anbncn  \  n>  1}.  (See  Problem  4.38.) 

4.9.2  Context-Sensitive  Languages 

The  context-sensitive  languages  are  exactly  the  languages  accepted  by  linear  bounded  automata, 
nondeterministic  Turing  machines  whose  tape  heads  visit  a  number  of  cells  that  is  a  constant 
multiple  of  the  length  of  an  input  string.  (See  Problem  4.36.) 

DEFINITION  4.9.2  A  context-sensitive  grammar  G  is  a  phrase  structure  grammar  G  =  (A f, 
T,  1Z,  S)  in  which  each  rule  (a,  b)  £  1Z  satisfies  the  condition  that  b  has  no  fewer  characters 
than  does  a,  namely,  |a|  <  |fo|.  The  languages  defined  by  context-sensitive  grammars  are  called 

context-sensitive  languages  ( CSL). 

Each  rule  of  a  context-sensitive  grammar  maps  a  string  to  one  that  is  no  shorter.  Since  the 
left-hand  side  of  a  rule  may  have  more  than  one  character,  it  may  make  replacements  based 
on  the  context  in  which  a  non-terminal  is  found.  Examples  of  context-sensitive  languages  are 
given  in  Problems  4.38  and  4.39. 

4.9.3  Context-Free  Languages 

As  shown  in  Section  4.12,  the  context-free  languages  are  exactly  the  languages  accepted  by 
pushdown  automata. 

DEFINITION  4.9.3  A  context-free  grammar  G  =  (M,  T,  7 Z,  s)  is  a  phrase  structure  grammar 
in  which  each  rule  inlZ  C  M  x  V*  has  a  single  non-terminal  on  the  left-hand  side.  The  languages 
defined  by  context-free  grammars  are  called  context-free  languages  ( CFL). 

Each  rule  of  a  context-free  grammar  maps  a  non-terminal  to  a  string  over  V*  without 
regard  to  the  context  in  which  the  non-terminal  is  found  because  the  left-hand  side  of  each 
rule  consists  of  a  single  non-terminal. 

EXAMPLE  4.9.2  Let  M2  =  {s,  a},  T)  =  {e,  a,  b},  andlZi  =  {S  — >  aSb,  S  — >  e}.  Then  the 
grammar  Gb  =  (A/i,  Ti,  TZi,  S)  is  context-free  and  generates  the  language  L(Gi)  =  {anbn  \  n  > 
0}.  To  see  this,  let  the  rule  S  — >  aSb  be  applied  k  times  to  produce  the  string  akSbk.  A  final 
application  of  the  last  rule  establishes  the  result. 
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EXAMPLE  4.9.3  Consider  the  grammar  G3  with  the  following  rules  and  the  implied  terminal  and 
non-terminal  alphabets: 

a)  S  — >  cMcNc  d)  N  —*  bNb 

b)  M  — >  aMa  e)  N  — >  c 

c)  M  — >  c 

G3  is  context-free  and  generates  the  language  L[Gf)  =  {can  can  cbm  cbm  c  \n,m  >  0},  as  is 
easily  shown. 

Context-free  languages  capture  important  aspects  of  many  programming  languages.  As 
a  consequence,  the  parsing  of  context-free  languages  is  an  important  step  in  the  parsing  of 
programming  languages.  This  topic  is  discussed  in  Section  4.1 1. 

4.9.4  Regular  Languages 

DEFINITION  4.9.4  A  regular  grammar  G  is  a  context-free  grammar  G  =  (A f,  T,  TZ,  s),  where 
the  right-hand  side  is  either  a  terminal  or  a  terminal  folloived  by  a  non-terminal.  That  is,  its  rules 
are  of the  form  A  — >  a  or  A  — >  be.  The  languages  defined  by  regular  grammars  are  called  regular 
languages. 

Some  authors  define  a  regular  grammar  to  be  one  whose  rules  are  of  the  form  A  — >  a 
or  A  — >  &162  ’  '  ’  bkC.  It  is  straightforward  to  show  that  any  language  generated  by  such  a 
grammar  can  be  generated  by  a  grammar  of  the  type  defined  above. 

The  following  grammar  is  regular. 

EXAMPLE  4.9.4  Consider  the  grammar  G 4  =  (A/4,  T4,  72-4,  S)  where  M 4  =  {s,A,  b},  7}  = 
{0,1}  andTZ-4  consists  of  the  rules  given  below. 

a)  S  — >  0A  d)  B  — >  0A 

b)  S  — >  0  e)  B  — >  0 

c)  A  — >  IB 

It  is  straightforward  to  see  that  the  rules  a)  S  — »  0,  b)  S  — >  OlB,  c)  B  — >  0,  and  d)  B  — >  OlB 
generate  the  same  strings  as  the  rules  given  above.  Thus,  the  language  G4  contains  the  strings 
0,010,01010,0101010,...,  that  is,  strings  of  the  form  (01)fe0  for  k  >  0.  Consequently 
L{G4)  =  (01)*0.  A  formal  proof  of  this  result  is  left  to  the  reader.  (See  Problem  4.44.) 

4.10  Regular  Language  Recognition 

As  explained  in  Section  4.1,  a  deterministic  finite-state  machine  (DFSM)  M  is  a  five- tuple 
M  =  (S,  Q,  S,  s,  F),  where  S  is  the  input  alphabet,  Q  is  the  set  of  states,  <5  :  Q  x  S  1— >  Q  is 
the  next-state  function,  s  is  the  initial  state,  and  F  is  the  set  of  final  states.  A  nondeterministic 
FSM  (NFSM)  is  similarly  defined  except  that  <5  is  a  next-set  function  5  :  Q  x  S  1— »  2®.  In 
other  words,  in  an  NFSM  there  may  be  more  than  one  next  state  for  a  given  state  and  input. 
In  Section  4.2  we  showed  that  the  languages  recognized  by  these  two  machine  types  are  the 
same. 

We  now  show  that  the  languages  L(G)  and  L(G )  U  {e}  defined  by  regular  grammars  G 
are  exactly  those  recognized  by  FSMs. 
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THEOREM  4.  I  0. 1  The  languages  L(G )  and  L(G )  U  {e}  generated  by  regular  grammars  G  and 
recognized  by  finite-state  machines  are  the  same. 

Proof  Given  a  regular  grammar  G,  we  construct  a  corresponding  NFSM  M  that  accepts 
exactly  the  strings  generated  by  G.  Similarly,  given  a  DFSM  M  we  construct  a  regular 
grammar  G  that  generates  the  strings  recognized  by  M. 

From  a  regular  grammar  G  =  (A/”,  T,  1Z,  S)  with  rules  TZ  of  the  form  A  — >  a  and 
A  — »  be  we  create  a  grammar  G'  generating  the  same  language  by  replacing  a  rule  A  — >  a 
with  rules  A  — >  aB  and  B  — >  e  where  B  is  a  new  non-terminal  unique  to  A  — >  a.  Thus, 
every  derivation  S  =><3  w,  w  £  T* ,  now  corresponds  to  a  derivation  S  =>e  where 
B  — >  e.  Hence,  the  strings  generated  by  G  and  G'  are  the  same. 

Now  construct  an  NFSM  Me  whose  states  correspond  to  the  non-terminals  of  this  new 
regular  grammar  and  whose  input  alphabet  is  its  set  of  terminals.  Let  the  start  state  of  Me 
be  labeled  S.  Let  there  be  a  transition  from  state  A  to  state  B  on  input  a  if  there  is  a  rule 
A  — >  aB  in  G' .  Let  a  state  B  be  a  final  state  if  there  is  a  rule  of  the  form  B  — »  e  in  G' . 
Clearly,  every  derivation  of  a  string  w  in  L(G')  corresponds  to  a  path  in  M  that  begins  in 
the  start  state  and  ends  on  a  final  state.  Hence,  w  is  accepted  by  Mq<  .  On  the  other  hand, 
if  a  string  w  is  accepted  by  Me ,  given  the  one-to-one  correspondence  between  edges  and 
rules,  there  is  a  derivation  of  w  from  S  in  G' .  Thus,  the  strings  generated  by  G  and  the 
strings  accepted  by  Mg'  are  the  same. 

Now  assume  we  are  given  a  DFSM  M  that  accepts  a  language  Lm-  Create  a  grammar 
Gm  whose  non-terminals  are  the  states  of  M  and  whose  start  symbol  is  the  start  state  of  M. 
Gm  has  a  rule  of  the  form  q\  — >  aq2  if  M  makes  a  transition  from  state  q\  to  qn  on  input 
a.  If  state  q  is  a  final  state  of  M,  add  the  rule  q  — >  e.  If  a  string  is  accepted  by  M,  that  is,  it 
causes  M  to  move  to  a  final  state,  then  Gm  generates  the  same  string.  Since  Gm  generates 
only  strings  of  this  kind,  the  language  accepted  by  M  is  is  L{Gm)-  Now  convert  Gm  to 
a  regular  grammar  Gm  by  replacing  each  pair  of  rules  q\  — >  aq2,  qi  — >  e  by  the  pair 
q\  — >  aq2,  q\  — >  a,  deleting  all  rules  q  — >■  e  corresponding  to  unreachable  final  states  q, 
and  deleting  the  rule  S  — >  e  if  e  £  Lm-  Then,  Lm  —  {e}  =  L(Gm)  ~  {e}  =  L(Gm)-  ■ 


Figure  4.27  A  nondeterministic  FSM  that  accepts  a  language  generated  by  a  regular  language  in 
which  all  rules  are  of  the  form  A  — >  fee  or  A  — >  e.  A  state  is  associated  with  each  non-terminal,  the 
start  symbol  S  is  associated  with  the  start  state,  and  final  states  are  associated  with  non-terminals 
A  such  that  A  — >  e.  This  particular  NFSM  accepts  the  language  L(Gd)  of  Example  4.9.4. 


186 


Chapter  4  Finite-State  Machines  and  Pushdown  Automata  Models  of  Computation 


A  simple  example  illustrates  the  construction  of  an  NFSM  from  a  regular  grammar.  Con¬ 
sider  the  grammar  G4  of  Example  4.9.4.  A  new  grammar  G\  is  constructed  with  the  following 
rules:  a)  S  — *  OA,  b)  S  — *  OC,  c)  C  — >  e,  d)  A  — >  lB,  e)  B  — *  OA,  f)  B  — >  OD,  and  g)  D  — »  e. 

Figure  4.27  (page  185)  shows  an  NFSM  that  accepts  the  language  generated  by  this  gram¬ 
mar.  A  DFSM  recognizing  the  same  language  can  be  obtained  by  invoking  the  construction  of 
Theorem  4.2.1. 

4.11  Parsing  Context-Free  Languages 

Parsing  is  the  process  of  deducing  those  rules  of  a  grammar  G  (a  derivation)  that  generates  a 
terminal  string  w.  The  first  rule  must  have  the  start  symbol  S  on  the  left-hand  side.  In  this 
section  we  give  a  brief  introduction  to  the  parsing  of  context-free  languages,  a  topic  central 
to  the  parsing  of  programming  languages.  The  reader  is  referred  to  a  textbook  on  compilers 
for  more  detail  on  this  subject.  (See,  for  example,  [11]  and  [99].)  The  concepts  of  Boolean 
matrix  multiplication  and  transitive  closure  are  used  in  this  section,  topics  that  are  covered  in 
Chapter  6. 

Generally  a  string  w  has  many  derivations.  This  is  illustrated  by  the  context-free  grammar 
G3  defined  in  Example  4.9.3  and  described  below. 

EXAMPLE  4. 1  I .  I  G3  =  (A f3,  T3,  TZ3,  s),  where  A f3  =  {S,  M,  n},  T3  =  {a,  B,  c}  and  H3 
consists  of  the  rules  below: 

a)  S  — >  cMNc  d)  N  — >  bNb 

b)  M  — >  aMo  e)  N  — >  c 

c)  M  — >  c 

The  string  caacaabcbc  can  be  derived  by  applying  rules  (a),  (b)  twice,  (c),  (d)  and  (e)  to 
produce  the  following  derivation: 

S  =>  cMNc  =>  cflMaNc  =>  ca2Ma2Nc  ^ 

=>  ca2ca2Nc  =>  ca2ca2bNbc  =>  ca2ca2bcbc 

The  same  string  can  be  obtained  by  applying  the  rules  in  the  following  order:  (a),  (d),  (e), 
(6)  twice,  and  (c).  Both  derivations  are  described  by  the  parse  tree  of  Fig.  4.28.  In  this  tree 
each  instance  of  a  non-terminal  is  rewritten  using  one  of  the  rules  of  the  grammar.  The  order 
of  the  descendants  of  a  non-terminal  vertex  in  the  parse  tree  is  the  order  of  the  corresponding 
symbols  in  the  string  obtained  by  replacing  this  non-terminal.  The  string  ca2ca2bcbc,  the 
yield  of  this  parse  tree,  is  the  terminal  string  obtained  by  visiting  the  leaves  of  this  tree  in  a 
left-to-right  order.  The  height  of  the  parse  tree  is  the  number  of  edges  on  the  longest  path 
(having  the  most  edges)  from  the  root  (associated  with  the  start  symbol)  to  a  terminal  symbol. 
A  parser  for  a  language  L(G)  is  a  program  or  machine  that  examines  a  string  and  produces  a 
derivation  of  the  string  if  it  is  in  the  language  and  an  error  message  if  not. 

Because  every  string  generated  by  a  context-free  grammar  has  a  derivation,  it  has  a  cor¬ 
responding  parse  tree.  Given  a  derivation,  it  is  straightforward  to  convert  it  to  a  leftmost 
derivation,  a  derivation  in  which  the  leftmost  remaining  non-terminal  is  expanded  first.  (A 
rightmost  derivation  is  a  derivation  in  which  the  rightmost  remaining  non-terminal  is  ex¬ 
panded  first.)  Such  a  derivation  can  be  obtained  from  the  parse  tree  by  deleting  all  vertices 
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Figure  4.28  A  parse  tree  for  the  grammar  G3 . 


associated  with  terminals  and  then  traversing  the  remaining  vertices  in  a  depth-first  manner 
(visit  the  first  descendant  of  a  vertex  before  visiting  its  siblings),  assuming  that  descendants  of 
a  vertex  are  ordered  from  left  to  right.  When  a  vertex  is  visited,  apply  the  rule  associated  with 
that  vertex  in  the  tree.  The  derivation  given  in  (4.2)  is  leftmost. 

Not  only  can  some  strings  in  a  context-free  language  have  multiple  derivations,  but  in 
some  languages  they  have  multiple  parse  trees.  Languages  containing  strings  with  more  than 
one  parse  tree  are  said  to  be  ambiguous  languages.  Otherwise  languages  are  non-ambiguous. 

Given  a  string  that  is  believed  to  be  generated  by  a  grammar,  a  compiler  attempts  to  parse 
the  string  after  first  scanning  the  input  to  identify  letters.  If  the  attempt  fails,  an  error  message 
is  produced.  Given  a  string  generated  by  a  context-free  grammar,  can  we  guarantee  that  we  can 
always  find  a  derivation  or  parse  tree  for  that  string  or  determine  that  none  exists?  The  answer 
is  yes,  as  we  now  show. 

To  demonstrate  that  every  CFL  can  be  parsed,  it  is  convenient  first  to  convert  the  grammar 
for  such  a  language  to  Chomsky  normal  form. 

DEFINITION  4. 1  I .  I  A  context-free  grammar  G  is  in  Chomsky  normal  form  if  every  rule  is  of 
the  form  A  — >  BC  or  A  — >  u,  u  G  T  except  ife£  L(G),  in  which  case  S  — >  e  is  also  in  the 
grammar. 

We  now  give  a  procedure  to  convert  an  arbitrary  context-free  grammar  to  Chomsky  normal 
form. 

THEOREM  4. 1  I .  I  Every  context-free  language  can  be  generated  by  a  grammar  in  Chomsky  normal 
form. 

Proof  Let  L  =  L{G)  where  G  is  a  context-free  grammar.  We  construct  a  context-free  gram¬ 
mar  G'  that  is  in  Chomsky  normal  form.  The  process  described  in  this  proof  is  illustrated 
by  the  example  that  follows. 

Initially  G'  is  identical  with  G.  We  begin  by  eliminating  all  e-rules  of  the  form  B  — >  e. 
except  for  S  — >  e  if  e  £  L(G).  If  either  B  — >  e  or  B  =>  e,  for  every  rule  that  has  B  on  the 
right-hand  side,  such  as  A  — >  aBpBy,  a,  /3,  7  £  {V  —  {b})*  (V  =  Af  U  T),  we  add  a  rule 
for  each  possible  replacement  of  B  by  e;  for  example,  we  add  A  — >  a/3 B7,  A  — >  aB/3y, 
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and  A  — >  afj.  Clearly  the  strings  generated  by  the  new  rules  are  the  same  as  are  generated 
by  the  old  rules. 

Let  A  — >  W\  ■  ■  ■  Wi  ■  ■  ■  Wk  for  some  k  >  1  be  a  rule  in  G'  where  Wi  £  V .  We  replace 
this  rule  with  the  new  rules  A  — >  Z1Z2  •  •  ■  Zk,  and  Zi  Wi  for  1  <  i  <  k.  Here  Zi  is  a 
new  non-terminal.  Clearly,  the  new  version  of  G'  generates  the  same  language  as  does  G. 

With  these  changes  the  rules  of  G'  consist  of  rules  either  of  the  form  A  — >  u,  u  £  T 
(a  single  terminal)  or  A  — >  w,  w  £  Af+  (a  string  of  at  least  one  non-terminal).  There  are 
two  cases  of  w  £  J\f+  to  consider,  a)  in  =  1  and  b)  \v)\  >  2.  We  begin  by  eliminating  all 
rules  of  the  first  kind,  that  is  of  the  form  A  — >  B. 

Rules  of  the  form  A  — >  B  can  be  cascaded  to  form  rules  of  the  type  C  =>  D.  The  number 
of  distinct  derivations  of  this  kind  is  at  most  |A/  |!  because  if  any  derivation  contains  two 
instances  of  a  non-terminal,  the  derivation  can  be  shortened.  Thus,  we  need  only  consider 
derivations  in  which  each  non-terminal  occurs  at  most  once.  For  each  such  pair  C,  D  with 
a  relation  of  this  kind,  add  the  rule  C  — >  D  to  Gf  If  C  — »  D  and  D  — >  w  for  \w\  >  2  or 
w  =  u  £  T,  add  C  — >  w  to  the  set  of  rules.  After  adding  all  such  rules,  delete  all  rules  of 
the  form  A  — >  B.  By  construction  this  new  set  of  rules  generates  the  same  language  as  the 
original  set  of  rules  but  eliminates  all  rules  of  the  first  kind. 

We  now  replace  rules  of  the  type  A  — ►  AjA2  •  •  •  A^,  k  >  3.  Introduce  k  —  2  new 
non-terminals  Ni,  N2,  •  •  • ,  Nfc_2  peculiar  to  this  rule  and  replace  the  rule  with  the  following 
rules:  A  — >  AiN1;  Ni  — >  A2N2,  •  •  • ,  Nfe_3  — >  Afc_2N fe_2,  Nfe_2  — >  Afc_iAfe.  Clearly,  the 
new  grammar  generates  the  same  language  as  the  original  grammar  and  is  in  the  Chomsky 
normal  form.  ■ 

EXAMPLE  4.1  1.2  Let  G$  -  (A/5, 7^,  1Z$,  e)  (with  start  symbol  E)  be  the  grammar  with  A/5  = 
{E,  T,  f},  7^  =  {a, b, +,*,(,)},  andlZ$  consisting  of  the  rules  given  below: 

a)  E  — *  E  +  T  d)  T  — >  F  /)  F  — >  a 

b)  E  ->  T  e)  F  — >  (e)  g)  F  ->  b 

c)  T  — >  T  *  F 

Here  E,  T,  and  F  denote  expressions,  terms,  and factors.  It  is  straigh  forward  to  show  that  E  =4>  (a  * 
b  +  a)  *  (a  +  b)  and  E  =>  a*  b  +  a  are  two  possible  derivations. 

We  convert  this  grammar  to  the  Chomsky  normal  form  using  the  method  described  in  the 
proof  of  Theorem  4.11.1.  Since  1Z  contains  no  e-rules,  we  do  not  need  the  rule  E  — >  e,  nor 
do  we  need  to  eliminate  e-rules. 

First  we  convert  rules  of  the  form  A  — >  w  so  that  each  entry  in  w  is  a  non-terminal.  To 
do  this  we  introduce  the  non-terminals  (,  ),  +,  and  *  and  the  rules  below.  Here  we  use  a 
boldface  font  to  distinguish  between  the  non-terminal  and  terminal  equivalents  of  these  four 
mathematical  symbols.  Since  we  are  adding  to  the  original  set  of  rules,  we  number  them 
consecutively  with  the  original  rules. 

h)  (  ->  (  j)  +  ->  + 

*))—>)  k)  *  — >■  * 

Next  we  add  rules  of  the  form  C  — >  D  for  all  chains  of  single  non-terminals  such  that 
C  =>■  D.  Since  by  inspection  E  =>■  F,  we  add  the  rule  E  — >  F.  For  every  rule  of  the  form  A  — >  B 
for  which  B  — ►  w,  we  add  the  rule  A  — ►  w.  We  then  delete  all  rules  of  the  form  A  — >  B.  These 
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changes  cause  the  rules  of  G'  to  become  the  following.  (Below  we  use  a  different  numbering 
scheme  because  all  these  rules  replace  rules  (a)  through  ( k ).) 


1) 

E  - 

->  E+T 

7) 

T  - 

■+  (E) 

13) 

( 

-  ( 

2) 

E  - 

->  T*F 

8) 

T  - 

->  a 

14) 

) 

-  ) 

3) 

E  - 

■+  (E) 

9) 

T  - 

->  b 

15) 

+ 

->  + 

4) 

E  - 

->  a 

10) 

F  - 

■+  (E) 

16) 

* 

->  * 

5) 

E  - 

-4  b 

11) 

F  - 

-»  a 

6) 

T  - 

->  T*F 

12) 

F  - 

■*  b 

We  now  reduce  the  number  of  non-terminals  on  the  right-hand  side  of  each  rule  to  two 
through  the  addition  of  new  non-terminals.  The  result  is  shown  in  Example  4.11.3  below, 
where  we  have  added  the  non-terminals  A,  B,  C,  D,  G,  and  H. 

EXAMPLE  4.1  1.3  Let  Gg  =  (A/g,  7 g,  7?.g,  e)  ( with  start  symbol  E)  be  the  grammar  with  A/g  = 
{a,  B,  C,  D,  E,  F,  G,  H,  T,  +>*>(>)}>  7g  =  {a,  b,  +,  *,  (, )},  and  lZn  consisting  of  the  rides  given 
below. 


(71) 

E  - 

-»  EA 

(I) 

T  - 

-+  TD 

(Q) 

H  - 

->  E) 

(■ B ) 

A  - 

->  +T 

(J) 

D  - 

-»  *F 

(f?) 

F 

->  a 

(C) 

E  - 

-»  TB 

(K) 

T  - 

->  (G 

(■ s ) 

F  - 

-4  b 

( D ) 

B  - 

->  si=F 

(L) 

G  - 

->  E) 

(T) 

( 

-  ( 

(E) 

E  - 

->  (c 

(■ M ) 

T  - 

->  a 

(U) 

)  - 

-  ) 

( F ) 

c  - 

-4  E) 

(N) 

T  - 

-4  b 

( V ) 

+ 

-4  + 

(G) 

E  - 

-+  a 

( P ) 

F  - 

s  (H 

(W) 

* 

-4  * 

(H) 

E  - 

+  b 

The  new  grammar  clearly  generates  the  same  language  as  does  the  original  grammar,  but  it 
is  in  Chomsky  normal  form.  It  has  22  rules,  13  non-terminals,  and  six  terminals  whereas  the 
original  grammar  had  seven  rules,  three  non-terminals,  and  six  terminals. 

We  now  use  the  Chomsky  normal  form  to  show  that  for  every  CFL  there  is  a  polynomial¬ 
time  algorithm  that  tests  for  membership  of  a  string  in  the  language.  This  algorithm  can  be 
practical  for  some  languages. 

THEOREM  4.1  1.2  Given  a  context-free  grammar  G  =  (J\f,T,lZ,s),  an  0(n3|A/"|2)-. step  algo¬ 
rithm  exists  to  determine  whether  or  not  a  string  w  €  T*  of  length  n  is  in  L(G)  and  to  construct 
a  parse  tree  for  it  if  it  exists. 

Proof  If  G  is  not  in  Chomsky  normal  form,  convert  it  to  this  form.  Given  a  string  w  = 
( Wi ,  W2, . .  . ,  wn ),  the  goal  is  to  determine  whether  or  not  S  w.  Let  0  denote  the  empty 
set.  The  approach  taken  is  to  construct  an  (n  +  1)  x  [n  +  1)  set  matrix  S  whose  entries 
are  sets  of  non-terminals  of  G  with  the  property  that  the  i,j  entry,  a^j,  is  the  set  of  non¬ 
terminals  C  such  that  C  il\  ■  ■  ■  Wj_  \ .  Thus,  the  string  w  is  in  L(G )  if  S  £  aitU+1,  since 
S  generates  the  entire  string  w.  Clearly,  dij  =  0  for  j  <  i.  We  illustrate  this  construction 
with  the  example  following  this  proof. 

We  show  by  induction  that  set  matrix  S  is  the  transitive  closure  (denoted  B+ )  of  the 
(n  +  1)  x  (n  +  I)  set  matrix  B  whose  i,  j  entry  bij  =  0  for  j  f  i  +  1  when  \  <  i  <  n 
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and  bi  i+l  is  defined  as  follows: 

wi)  in  1Z  where  Wi  £  T} 

0  ...  0  ' 

&2,3  ■■■  0 

0  •••  ^n,n+ 1 

0  ...  0 

Thus,  the  entry  bij+i  is  the  set  of  non-terminals  that  generate  the  2  th  terminal  symbol  w, 

of  w  in  one  step.  The  value  of  each  entry  in  the  matrix  B  is  the  empty  set  except  for  the 
entries  b^i+i  for  1  <  i  <  n,  n  =  |tt?|. 

We  extend  the  concept  of  matrix  multiplication  (see  Chapter  6)  to  the  product  of  two 
set  matrices.  Doing  this  requires  a  new  definition  for  the  product  of  two  sets  (entries  in  the 
matrix)  as  well  as  for  the  addition  of  two  sets.  The  product  Si  •  Si  of  sets  of  nonterminals 
Si  and  S2  is  defined  as: 

Si  •  S2  =  {A  I  there  exists  B  £  Si  and  C  £  S2  such  that  (A  — >  BC)  £  1Z} 

Thus,  Si  •  S2  is  the  set  of  non-terminals  for  which  there  is  a  rule  in  1Z  of  the  form  A  — >  BC 
where  B  £  Si  and  C  £  S2.  The  sum  of  two  sets  is  their  union. 

The  i,j  entry  of  the  product  C  =  D  x  E  of  two  m  X  m  matrices  D  and  E,  each 
containing  sets  of  non-terminals,  is  defined  below  in  terms  of  the  product  and  union  of  sets: 

m 

Ci’0  =  U  di.k  ’  ek,j 

k= X 

We  also  define  the  transitive  closure  C+  of  an  m  X  m  matrix  C  as  follows: 

c+  =  c(1)  u  c(2)  u  c(3)  u  •  •  •  c(m) 


bi!i+i  =  {a  I  (a 
0  b\f2 

B  = 


where 

S—  1 

C(s)  =  U  6,(r)  x  C{s~r)  and  C(1)  =  C 

r—  1 

By  the  definition  of  the  matrix  product,  the  entry  fej2'  of  the  matrix  B^  is  0  if  j  yf  i  +  2 
and  otherwise  is  the  set  of  non-terminals  A  that  produce  WiWi+x  through  a  derivation  tree 
of  depth  2;  that  is,  there  are  rules  such  that  A  — >  BC,  B  — >  Wi,  and  C  — >  1,  which 

implies  that  A  WiWi+\ . 

Similarly,  it  follows  that  both  B^B^  and  B^ B^  are  0  in  all  positions  except  i,  i  +  3 
for  1  <  i  <  n  —  2.  The  entry  in  position  i,  i  +  3  of  B ®  =  B^B^  (J  B^ B^1'1 
contains  the  set  of  non-terminals  A  that  produce  WiWi+iWi+2  through  a  derivation  tree  of 
depth  3;  that  is,  A  — >  BC  and  either  B  produces  WiWi+\  through  a  derivation  of  depth  2 
(B  =>  WiWi+\)  and  C  produces  Wi+ 2  in  one  step  (C  — >  Wi+2 )  or  B  produces  Wi  in  one  step 
(B  — >  Wi)  and  C  produces  Wi+\Wi+2  through  a  derivation  of  depth  2  (C  =>■  Wi+\Wi+2)- 
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Finally,  the  only  entry  in  B^  that  is  not  0  is  the  1,  n  +  1  entry  and  it  contains  the  set 
of  non-terminals,  if  any,  that  generate  W.  If  S  is  in  this  set,  w  is  in  L(G). 

The  transitive  closure  S  =  B+  involves  ]>©=1  r  =  (n  +  l)ro/2  products  of  set  matrices. 
The  product  of  two  (n  +  1)  x  (n  +  1)  set  matrices  of  the  type  considered  here  involves  at 
most  n  products  of  sets.  Thus,  at  most  0(n3)  products  of  sets  is  needed  to  form  S.  In  turn, 
a  product  of  two  sets,  Si  ■  Sj,  can  be  formed  with  0{q 2)  operations,  where  q  =  |Af|  is  the 
number  of  non-terminals.  It  suffices  to  compare  each  pair  of  entries,  one  from  S\  and  the 
other  from  S2,  through  a  table  to  determine  if  they  form  the  right-hand  side  of  a  rule. 

As  the  matrices  are  being  constructed,  if  a  pair  of  non-terminals  is  discovered  that  is  the 
right-hand  side  of  a  rule,  that  is,  A  — >  BC,  then  a  link  can  be  made  from  the  entry  A  in  the 
product  matrix  to  the  entries  B  and  C.  From  the  entry  S  in  a\^n+\,  if  it  exists,  links  can  be 
followed  to  generate  a  parse  tree  for  the  input  string.  ■ 


The  procedure  described  in  this  proof  can  be  extended  to  show  that  membership  in  an 
arbitrary  CFL  can  be  determined  in  time  0(M(n)),  where  M (n)  is  the  number  of  operations 
to  multiply  two  n  X  n  matrices  [342].  This  is  the  fastest  known  general  algorithm  for  this 
problem  when  the  grammar  is  part  of  the  input.  For  some  CFLs,  faster  algorithms  are  known 
that  are  based  on  the  use  of  the  deterministic  pushdown  automaton.  For  fixed  grammars 
membership  algorithms  often  run  in  O(n)  steps.  The  reader  is  referred  to  books  on  compilers 
for  such  results.  The  procedure  of  the  proof  is  illustrated  by  the  following  example. 

EXAMPLE  4.1  1.4  Consider  the  grammar  G(,  of  Example  4.11.3.  We  show  how  the  five-character 
string  a*b  +  a  in  L(Gf)  can  be  parsed.  We  construct  the  6  x  6  matrices  B^\  B^2\  B^\  B© 
£(5) 

,  as  shown  below.  Since  B^  contains  E  in  the  1,  n  +  1  position,  a*b  +  a  is  in  the  language. 
Furthermore,  we  can  follow  links  between  non-terminals  (not  shown)  to  demonstrate  that  this  string 
has  the  parse  tree  shown  in  Fig.  4.29.  The  matrix  B ^  is  not  shown  because  each  of  its  entries  is  0. 
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Figure  4.29  The  parse  tree  for  the  string  a  *  b  +  a  in  the  language  L(Ge)- 
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4.12  CFL  Acceptance  with  Pushdown  Automata* 

While  it  is  now  clear  that  an  algorithm  exists  to  parse  every  context-free  language,  it  is  useful 
to  show  that  there  is  a  class  of  automata  that  accepts  exactly  the  context-free  languages.  These 
are  the  nondeterministic  pushdown  automata  (PDA)  described  in  Section  4.8. 

We  now  establish  the  principal  results  of  this  section,  namely,  that  the  context-free  lan¬ 
guages  are  accepted  by  PDAs  and  that  the  languages  accepted  by  PDAs  are  context-free.  We 
begin  with  the  first  result. 

THEOREM  4. 12.1  For  each  context-free  grammar  G  there  is  a  PDA  M  that  accepts  L{G).  That 
is,  L(M)  =  L(G). 

Proof  Before  beginning  this  proof,  we  extend  the  definition  of  a  PDA  to  allow  it  to  push 
strings  onto  the  stack  instead  of  just  symbols.  That  is,  we  extend  the  stack  alphabet  T  to 
include  a  small  set  of  strings.  When  a  string  such  as  abed  is  pushed,  a  is  pushed  before  b,  b 
before  c,  etc.  This  does  not  increase  the  power  of  the  PDA,  because  for  each  string  we  can 
add  unique  states  that  M  enters  after  pushing  each  symbol  except  the  last.  With  the  pushing 
of  the  last  symbol  M  enters  the  successor  state  specified  in  the  transition  being  executed. 

Let  G  =  (A f,  T ,  1Z,  S)  be  a  context-free  grammar.  We  construct  a  PDA  M  =  (E,  T,  Q, 
A,  s,  F ),  where  E  =  T,  T  =JVuT  U  {7}  (7  is  the  blank  stack  symbol),  Q  =  {s,p,  /}, 
F  =  {/},  and  A  consists  of  transitions  of  the  types  shown  below.  Here  V  denotes  “for  all” 
and  V(A  1— >  w)  €  1Z  means  for  all  transitions  in  1Z. 
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a)  (s,  e,  e;p,  s) 

b)  ( p ,  a,  a;  p,  e)  Va  £  T 

c)  (p,  e,  A;  p, -a)  V(a  i— >  v)  €  1Z 

d)  (p,  e>  75  /.  e) 

Let  to  be  placed  left-adjusted  on  the  input  tape  of  M .  Since  w  is  generated  by  G,  it  has 
a  leftmost  derivation.  (Consider  for  example  that  given  in  (4.2)  on  page  1 86.)  The  PDA 
begins  by  pushing  the  start  symbol  S  onto  the  stack  and  entering  state  p  (Rule  (a)).  From 
this  point  on  the  PDA  simulates  a  leftmost  derivation  of  the  string  w  placed  initially  on  its 
tape.  (See  the  example  that  follows  this  proof.)  M  either  matches  a  terminal  of  G  on  the  top 
of  the  stack  with  one  under  the  tape  head  (Rule  (b))  or  it  replaces  a  non-terminal  on  the  top 
of  the  stack  with  a  rule  of  1Z  by  pushing  the  right-hand  side  of  the  rule  onto  the  stack  (Rule 
(c)).  Finally,  when  the  stack  is  empty,  M  can  choose  to  enter  the  final  state  /  and  accept  w. 
It  follows  that  any  string  that  can  be  generated  by  G  can  also  be  accepted  by  M  and  vice 
versa.  ■ 

The  leftmost  derivation  of  the  string  caacaabcbc  by  the  grammar  G3  of  Example  4.1 1.1 
is  shown  in  (4.2).  The  PDA  M  of  the  above  proof  can  simulate  this  derivation,  as  we  show. 
With  the  notation  T  :  ...  and  S  :  ...  (shown  below  before  the  computation  begins)  we 
denote  the  contents  of  the  tape  and  stack  at  a  point  in  time  at  which  the  underlined  symbols 
are  those  under  the  tape  head  and  at  the  top  of  the  stack,  respectively.  We  ignore  the  blank 
tape  and  stack  symbols  unless  they  are  the  ones  underlined. 

T  :  caacaabcbc  S  :  7 

After  the  first  step  taken  by  M,  the  tape  and  stack  configurations  are: 

T  :  caacaabcbc  S  :  S 

From  this  point  on  M  simulates  a  derivation  by  G3.  Consulting  (4.2),  we  see  that  the  rule 
S  — >  cMNc  is  the  first  to  be  applied.  M  simulates  this  with  the  transition  (p,  e,  S;  p,  cMNc), 
which  causes  S  to  be  popped  from  the  stack  and  cMNc  to  be  pushed  onto  it  without  advancing 
the  tape  head.  The  resulting  configurations  are  shown  below: 

T  :  caacaabcbc  S  :  cMNc 

Next  the  transition  (p,  c,  c;  p,  e)  is  applied  to  pop  one  item  from  the  stack,  exposing  the  non¬ 
terminal  M  and  advancing  the  tape  head  to  give  the  following  configurations: 

T  :  caacaabcbc  S  :  MNc 
The  subsequent  rules,  in  order,  are  the  following: 

1)  M  — >  aMa  3)  M  — >  c  5)  N  — >  c 

2)  M  — >  aMa  4)  N  — >  bNb 

The  corresponding  transitions  of  the  PDA  are  shown  in  Fig.  4.30. 

We  now  show  that  the  language  accepted  by  a  PDA  can  be  generated  by  a  context-free 
grammar. 
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T  : 

caacaabcbc 

S  : 

aMaNc 

T  : 

caacaabcbc 

S  : 

MaNc 

T  : 

caacaabcbc 

S  : 

aMflflNC 

T  : 

caacaabcbc 

S  : 

MaaNc 

T  : 

caacaabcbc 

S  : 

caaNc 

T  : 

caacaabcbc 

S  : 

aaNc 

T  : 

caacaabcbc 

S  : 

aNc 

T  : 

caacaabcbc 

S  : 

Nc 

T  : 

caacaabcbc 

S  : 

6n6c 

T  : 

caacaabcbc 

S  : 

N  be 

T  : 

caacaabcbc 

S  : 

ebe 

T  : 

caacaabcbc 

S  : 

be 

T  : 

caacaabcbc 

S  : 

c 

T  : 

caacaabcbc/3 

S  : 

7 

Figure  4.30  PDA  transitions  corresponding  to  the  leftmost  derivation  of  the  string  caacaabcbc 
in  the  grammar  Gj,  of  Example  4.11.1. 


THEOREM  4. 12.2  For  each  PDA  M  there  is  a  context-free  grammar  G  that  generates  the  language 
L(M)  accepted  by  M.  That  is,  L{G )  =  L(M). 

Proof  It  is  convenient  to  assume  that  when  the  PDA  M  accepts  a  string  it  does  so  with 
an  empty  stack.  If  M  is  not  of  this  type,  we  can  design  a  PDA  M'  accepting  the  same 
language  that  does  meet  this  condition.  The  states  of  M'  consist  of  the  states  of  M  plus 
three  additional  states,  a  new  initial  state  s' ,  a  cleanup  state  k,  and  a  new  final  state  f .  Its 
tape  symbols  are  identical  to  those  of  M.  Its  stack  symbols  consist  of  those  of  M  plus  one 
new  symbol  n.  In  its  initial  state  M'  pushes  n  onto  the  stack  without  reading  a  tape  symbol 
and  enters  state  s,  which  was  the  initial  state  of  M.  It  then  operates  as  M  (it  has  the  same 
transitions)  until  entering  a  final  state  of  M,  upon  which  it  enters  the  cleanup  state  k.  In 
this  state  it  pops  the  stack  until  it  finds  the  symbol  «,  at  which  time  it  enters  its  final  state 
f.  Clearly,  M'  accepts  the  same  language  as  M  but  leaves  its  stack  empty. 

We  describe  a  context-free  grammar  G  =  (A/”,  T,  1Z,  S)  with  the  property  that  L(G)  = 
L(M).  The  non-terminals  of  G  consist  of  S  and  the  triples  <  p,y,q  >  defined  below 
denoting  goals: 


<  p,y,q>  £  N  where  TV  C  Q  x  (T  U  {e})  x  Q 

The  meaning  of  <  p,y,q  >  is  that  M  moves  from  state  p  to  state  q  in  a  series  of  steps 
during  which  its  only  effect  on  the  stack  is  to  pop  y.  The  triple  <  p,e,q  >  denotes  the  goal 
of  moving  from  state  p  to  state  q  leaving  the  stack  in  its  original  condition.  Since  M  starts 
with  an  empty  stack  in  state  s  with  a  string  w  on  its  tape  and  ends  in  a  final  state  /  with 
its  stack  empty,  the  non-terminal  <  s,e,  f  >,  f  £  F,  denotes  the  goal  of  M  moving  from 
state  s  to  a  final  state  /  on  input  w,  and  leaving  the  stack  in  its  original  state. 
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The  rules  of  G,  which  represent  goal  refinement,  are  described  by  the  following  con¬ 
ditions.  Each  condition  specifies  a  family  of  rules  for  a  context-free  grammar  G.  Each 
rule  either  replaces  one  non-terminal  with  another,  replaces  a  non-terminal  with  the  empty 
string,  or  rewrites  a  non-terminal  with  a  terminal  or  empty  string  followed  by  one  or  two 
non-terminals.  The  result  of  applying  a  sequence  of  rules  is  a  string  of  terminals  in  the 
language  L(G).  Below  we  show  that  L(G)  =  L(M). 


<  s,  e,  f  >  V/  £  F 

e  \/p  £  Q 

x  <  q,  z,r  >  Vr  £  Q  and  V(p,  x,  y\  q,  z )  £  A, 

where  y  yf  e 

x  <  q,  z,t  ><  t,u,r  >  Vr,  t  £  Q,  V(p,  x,  e;  q,  z )  £  A, 
and  Vzt  €  T  U  {e} 

Condition  (1)  specifies  rules  that  map  the  start  symbol  of  G  onto  the  goal  non-terminal 
symbol  <  s,  e,  /  >  for  each  final  state  /.  These  rules  insure  that  the  start  symbol  of  G  is 
rewritten  as  the  goal  of  moving  from  the  initial  state  of  M  to  a  final  state,  leaving  the  stack 
in  its  original  condition. 

Condition  (2)  specifies  rules  that  map  non-terminals  <  p,  e,  p  >  onto  the  empty  string. 
Thus,  all  goals  of  moving  from  a  state  to  itself  leaving  the  stack  in  its  original  condition  can 
be  ignored.  In  other  words,  no  input  is  needed  to  take  M  from  state  p  back  to  itself  leaving 
the  stack  unchanged. 

Condition  (3)  specifies  rules  stating  that  for  all  r  £  Q  and  ( p ,  x,  y;  q,  z) ,  y  yf  e,  that  are 
transitions  of  M,  a  goal  <  p,  y,  r  >  to  move  from  state  p  to  state  r  while  removing  y  from 
the  stack  can  be  accomplished  by  reading  tape  symbol  x,  replacing  the  top  stack  symbol 
y  with  z,  and  then  realizing  the  goal  <  q,  z,r  >  of  moving  from  state  q  to  state  r  while 
removing  z  from  the  stack. 

Condition  (4)  specifies  rules  stating  that  for  all  r,t  £  Q  and  ( p ,  x,  e;  q,  z)  that  are 
transitions  of  M,  the  goal  <  p,u,r  >  of  moving  from  state  p  to  state  r  while  popping  u 
for  arbitrary  stack  symbol  u  can  be  achieved  by  reading  input  x  and  pushing  z  on  top  of  u 
and  then  realizing  the  goal  <  q,  z,t  >  of  moving  from  q  to  some  state  t  while  popping  2 
followed  by  the  goal  <  t ,  u,  r  >  of  moving  from  t  to  r  while  popping  u. 

We  now  show  that  any  string  accepted  by  M  can  be  generated  by  G  and  any  string 
generated  by  G  can  be  accepted  by  M.  It  follows  that  L(M)  =  L(G ).  Instead  of  showing 
this  directly,  we  establish  a  more  general  result. 

CLAIM:  For  all  r,t  £  Q  and  u  £  T  U  {e},  <  r,u,t  >=>g  w  if  and  only  if  the  PDA  M 
can  move  from  state  r  to  state  t  while  reading  w  and  popping  u  from  the  stack. 

The  theorem  follows  from  the  claim  because  <  s,e,  f  >=>g  w  if  and  only  if  the  PDA 
M  can  move  from  initial  state  s  to  a  final  state  /  while  reading  w  and  leaving  the  stack 
empty,  that  is,  if  and  only  if  M  accepts  w. 

We  first  establish  the  “if”  portion  of  the  claim,  namely,  if  for  r,t  £  Q  and  u  £  T  U  {e} 
the  PDA  M  can  move  from  r  to  t  while  reading  w  and  popping  u  from  the  stack,  then 
<  r,u,t  >=>g  w-  The  proof  is  by  induction  on  the  number  of  steps  taken  by  M.  If  no 


1)  S 

2)  <p,e,p> 

3)  <p,y,r  > 

4)  <  p,u,r  > 
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step  is  taken  (basis  for  induction),  r  =  t,  nothing  is  popped  and  the  string  e  is  read  by  M. 
Since  the  grammar  G  contains  the  rule  <  r,  e,  r  >— >  e,  the  basis  is  established. 

Suppose  that  the  “if”  portion  of  the  claim  is  true  for  k  or  fewer  steps  (inductive  hypoth¬ 
esis).  We  show  that  it  is  true  for  k  +  1  steps  (induction  step).  If  the  PDA  M  can  move 
from  r  to  t  in  k  +  1  steps  while  reading  w  =  xv  and  removing  u  from  the  stack,  then  on 
its  first  step  it  must  execute  a  transition  (r,  x,y;  q,  z),  q  £  Q,  z  £  T  U  {e},  for  x  £  £  with 
either  y  =  u  if  u  yf  e  or  y  =  e.  In  the  first  case,  M  enters  state  q,  pops  u,  and  pushes 
z.  M  subsequently  pops  z  as  it  reads  v  and  moves  to  state  t  in  k  steps.  It  follows  from  the 
inductive  hypothesis  that  <  q,  z,t  >=S>g  v-  Since  y  yf  e,  a  rule  of  type  (3)  applies,  that  is, 

<  r,y,t  >— >  x  <  q,  z,t  >.  It  follows  that  <  r,y,t  >=>g  w>  the  desired  conclusion. 

In  the  second  case  y  =  e  and  M  makes  the  transition  (r,  x,  e;  q,  z)  by  moving  from  r  to 
t  and  pushing  z  while  reading  x.  To  pop  u,  which  must  have  been  at  the  top  of  the  stack,  M 
must  first  pop  z  and  then  pop  u.  Let  it  pop  z  as  it  moves  from  q  to  some  intermediate  state 
t'  while  reading  a  first  portion  iq  of  the  input  word  v.  Let  it  pop  u  as  it  moves  from  t'  to  t 
while  reading  a  second  portion  v 2  of  the  input  word  v.  Here  V1V2  =  v.  Since  the  move  from 
q  to  t'  and  from  t'  to  t  each  involves  at  most  k  steps,  it  follows  that  the  goals  <  q,  z,  t'  > 
and  <  t’,u,r  >  satisfy  <  q,z,t'  >=>g  vi  and  <  t',u,r  >=S>g  v2-  Because  M’s  first 
transition  meets  condition  (4),  there  is  a  rule  <  r,u,t  >— >  x  <  q,z,tr  ><  t',u,r  >. 
Combining  these  derivations  yields  the  desired  conclusion. 

Now  we  establish  the  “only  if”  part  of  the  claim,  namely,  if  for  all  r,t  £  Q  and  u  £ 
r  U  {e},  <  r,u,t  >=5>g  w,  then  the  PDA  M  can  move  from  state  r  to  state  t  while 
reading  w  and  removing  u  from  the  stack.  Again  the  proof  is  by  induction,  this  time  on 
the  number  of  derivation  steps.  If  there  is  a  single  derivation  step  (basis  for  induction), 
it  must  be  of  the  type  stated  in  condition  (2),  namely  <  p,e,p  >— »  e.  Since  M  can 
move  from  state  p  to  p  without  reading  the  tape  or  pushing  data  onto  its  stack,  the  basis  is 
established. 

Suppose  that  the  “only  if”  portion  of  the  claim  is  true  for  k  or  fewer  derivation  steps 
(inductive  hypothesis).  We  show  that  it  is  true  for  k  +  1  steps  (induction  step).  That  is, 
if  <  r,u,t  >=k~G  w  in  k  +  1  steps,  then  we  show  that  M  can  move  from  r  to  t  while 
reading  w  and  popping  u  from  the  stack.  We  can  assume  that  the  first  derivation  step  is  of 
type  (3)  or  (4)  because  if  it  is  of  type  (2),  the  derivation  can  be  shortened  and  the  result  fol¬ 
lows  from  the  inductive  hypothesis.  If  the  first  derivation  is  of  type  (3),  namely,  of  the  form 

<  r,u,t  >— ►  x  <  q,  z,t  >,  then  by  the  inductive  hypothesis,  M  can  execute  (r,  x,  u\  q,  z), 
u  yf  e,  that  is,  read  x,  pop  u,  push  z,  and  enter  state  q.  Since  <  r,u,t  >=S>g  w>  where 
w  =  xv,  it  follows  that  <  q,  z,  t  >=S >g  v-  Again  by  the  inductive  hypothesis  M  can  move 
from  q  to  t  while  reading  v  and  popping  z.  Combining  these  results,  we  have  the  desired 
conclusion. 

If  the  first  derivation  is  of  type  (4),  namely,  <  r,u,t  >—>  x  <  q,  z,  t'  ><  t' ,  u,  t  >, 
then  the  two  non-terminals  <  q,  z,  t'  >  and  <  t' ,  u,t  >  must  expand  to  substrings  V\ 
and  V2,  respectively,  of  v  where  w  =  XV1V2  =  xv.  That  is,  <  q,z,t'  >4>g  vi  and 

<  t',u,t  >=5»g  vi-  By  the  inductive  hypothesis,  M  can  move  from  q  to  t'  while  read¬ 
ing  Vi  and  popping  z  and  it  can  also  move  from  t'  to  t  while  reading  V2  and  popping 
u.  Thus,  M  can  move  from  r  to  t  while  reading  w  and  popping  u,  which  is  the  desired 
conclusion.  ■ 
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4.13  Properties  of  Context-Free  Languages 

In  this  section  we  derive  properties  of  context-free  languages.  We  begin  by  establishing  a 
pumping  lemma  that  demonstrates  that  every  CFL  has  a  certain  periodicity  property.  This 
property,  together  with  other  properties  concerning  the  closure  of  the  class  of  CFLs  under  the 
operations  of  concatenation,  union  and  intersection,  is  used  to  show  that  the  class  is  not  closed 
under  complementation  and  intersection. 

4.13.1  CFL  Pumping  Lemma 

The  pumping  lemma  for  regular  languages  established  in  Section  4.5  showed  that  if  a  regular 
language  contains  an  infinite  number  of  strings,  then  it  must  have  strings  of  a  particular  form. 
This  lemma  was  used  to  show  that  some  languages  are  not  regular.  We  establish  a  similar  result 
for  context-free  languages. 

LEMMA  4.13.1  Let  G  =  (A f,  T,  1Z,  s)  be  a  context-free  grammar  in  Chomsky  normal  form 
with  m  non-terminals.  Then,  ifw  £  L(G )  and  \w\  >  2m~1  +  1,  there  are  strings  r ,  s,  t, 
u,  and  v  with  w  =  rstuv  such  that  |sw|  >  1  and  |s£u|  <  2m  and  for  all  integers  n  >  0, 
S  rsntunv  £  L{G). 

Proof  Since  each  production  is  of  the  form  A  — >  BC  or  A  — >  a,  a  subtree  of  a  parse  tree  of 
height  h  has  a  yield  (number  of  leaves)  of  at  most  2h~l .  To  see  this,  observe  that  each  rule 
that  generates  a  leaf  is  of  the  form  A  — >  a.  Thus,  the  yield  is  the  number  of  leaves  in  a  binary 
tree  of  height  h  —  1,  which  is  at  most  2h~l . 

Let  K  =  2m_1  +  1.  If  there  is  a  string  w  in  L  of  length  K  or  greater,  its  parse  tree 
has  height  greater  than  m.  Thus,  a  longest  path  P  in  such  a  tree  (see  Fig.  4.31(a))  has  more 


(a)  (b) 

Figure  4.31  L(G)  is  generated  by  a  grammar  G  in  Chomsky  normal  form  with  m  non¬ 
terminals.  (a)  Each  w  E  -E(G)  with  |iu|  >  2m_1  +  1  has  a  parse  tree  with  a  longest  path  P 
containing  at  least  m  +  1  non-terminals,  (b)  SP,  the  portion  of  P  containing  the  last  m  +  1 
non-terminals  on  P,  has  a  non-terminal  A  that  is  repeated.  The  derivation  A  — >  sAu  can  be 
deleted  or  repeated  to  generate  new  strings  in  L(G). 
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than  to  non-terminals  on  it.  Consider  the  subpath  SP  of  P  containing  the  last  to  +  1 
non-terminals  of  P.  Let  D  be  the  first  non-terminal  on  SP  and  let  the  yield  of  its  parse  tree 
be  y.  It  follows  that  \y\  <  2m.  Thus,  the  yield  of  the  full  parse  tree,  w,  can  be  written  as 
w  =  xyz  for  strings  x,  y,  and  z  in  T* . 

By  the  pigeonhole  principle  stated  in  Section  4.5,  some  non-terminal  is  repeated  on  SP. 
Let  A  be  such  a  non-terminal.  Consider  the  first  and  second  time  that  A  appears  on  SP. 
(See  Fig.  4.31(b).)  Repeat  all  the  rules  of  the  grammar  G  that  produced  the  string  y  except 
for  the  rule  corresponding  to  the  first  instance  of  A  on  SP  and  all  those  rules  that  depend 
on  it.  It  follows  that  D  =>  aAb  where  a  and  b  are  in  T* .  Similarly,  apply  all  the  rules  to 
the  derivation  beginning  with  the  first  instance  of  A  on  P  up  to  but  not  including  the  rules 
beginning  with  the  second  instance  of  A.  It  follows  that  A  =>■  sAu,  where  s  and  u  are  in  T* 
and  at  least  one  is  not  e  since  no  rules  of  the  form  A  — >  B  are  in  G.  Finally,  apply  the  rules 
starting  with  the  second  instance  of  A  on  P.  Let  A  =>  t  be  the  yield  of  this  set  of  rules.  Since 
A  =>  sAu  and  A  =>■  £,  it  follows  that  L  also  contains  xatbz.  L  also  contains  xasntunbz 
for  n  >  1  because  A  =>  sAu  can  be  applied  n  times  after  A  =>■  sAu  and  before  A  =>  t .  Now 
let  r  =  xa  and  v  =  bz.  ■ 

We  use  this  lemma  to  show  the  existence  of  a  language  that  is  not  context-free. 

LEMMA  4.  I  3.2  The  language  L  =  {anbncn  \  n  >  0}  over  the  alphabet  S  =  {a,  b,  c}  is  not 
context-free. 

Proof  We  assume  that  L  is  context-free  generated  by  a  grammar  with  to  non-terminals  and 
show  this  implies  L  contains  strings  not  in  the  language.  Let  no  =  2m~ 1  +  1 . 

Since  L  is  infinite,  the  pumping  lemma  can  be  applied.  Let  rstuv  =  anbncn  for  n  = 
no-  From  the  pumping  lemma  rs2tu2v  is  also  in  L.  Clearly  if  s  or  u  is  not  empty  (and  at 
least  one  is),  then  they  contain  either  one,  two,  or  three  of  the  symbols  in  E.  If  one  of  them, 
say  s,  contains  two  symbols,  then  s2  contains  a  b  before  an  a  or  a  c  before  a  b,  contradicting 
the  definition  of  the  language.  The  same  is  true  if  one  of  them  contains  three  symbols. 
Thus,  they  contain  exactly  one  symbol.  But  this  implies  that  the  number  of  as,  b’s,  and  c’s 
in  rs2tu2v  is  not  the  same,  whether  or  not  s  and  u  contain  the  same  or  different  symbols.  ■ 

4.13.2  CFL  Closure  Properties 

In  Section  4.6  we  examined  the  closure  properties  of  regular  languages.  We  demonstrated  that 
they  are  closed  under  concatenation,  union,  Kleene  closure,  complementation,  and  intersec¬ 
tion.  In  this  section  we  show  that  the  context-free  languages  are  closed  under  concatenation, 
union,  and  Kleene  closure  but  not  complementation  or  intersection.  A  class  of  languages  is 
closed  under  an  operation  if  the  result  of  performing  the  operation  on  one  or  more  languages 
in  the  class  produces  another  language  in  the  class. 

The  concatenation,  union,  and  Kleene  closure  of  languages  are  defined  in  Section  4.3.  The 
concatenation  of  languages  L\  and  L2,  denoted  L\  -L2>  is  the  language  {uv  \  u  £  L\  and  v  £ 
L  2}.  The  union  of  languages  L\  and  L2,  denoted  L\  U  L2,  is  the  set  of  strings  that  are  in  L\ 
or  L2  or  both.  The  Kleene  closure  of  a  language  L,  denoted  L*  and  called  the  Kleene  star,  is 
the  language  Ll  where  L°  =  {e}  and  Ll  =  L  ■  Ll_1. 

THEOREM  4. 13.1  The  context-free  languages  are  closed  under  concatenation,  union,  and  Kleene 
closure. 
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Proof  Consider  two  arbitrary  CFLs  L(Hf  and  L(772)  generated  by  grammars  H j  = 
(Afi,Ti,TZi,  Si)  and  H2  =  ( J\f2 ,  T2, 1Z2,  S2).  Without  loss  of  generality  assume  that  their 
non-terminal  alphabets  (and  rules)  are  disjoint.  (If  not,  prefix  every  non-terminal  in  the 
second  grammar  with  a  symbol  not  used  in  the  first.  This  does  not  change  the  language 
generated.) 

Since  each  string  in  L(H\)  ■  L(H2)  consists  of  a  string  of  L(H  1)  followed  by  a  string 
of  L(H2),  it  is  generated  by  the  context-free  grammar  H3  =  (A/3,73,  72-3,  S3)  in  which 
A3  =  A/)  U  A/2  U  {S3},  7}  =  7)  U  7},  and  1Z$  =  1Z\  U  1Z2  U  {S3  —*  S1S2}.  The  new  rule 
S3  — -  S1S2  generates  a  string  of  L (7/))  followed  by  a  string  of  L(H2).  Thus,  L(H\)  ■  L(H2) 
is  context-free. 

The  union  of  languages  L(H\)  and  L(H2)  is  generated  by  the  context-free  grammar 
7/4  =  (A/4,  T4,  IZ4,  S4)  in  which  A/4  =  A/)  U  J\f2  U  {S4},  7}  =  7)  U  7),  and  IZ4  =  1Z\  U 
1Z2  U  {S4  — >  S 1 ,  S4  — >  S2}.  To  see  this,  observe  that  after  applying  S4  — *  Si  all  subsequent 
rules  are  drawn  from  H\ .  (The  sets  of  non- terminals  are  disjoint.)  A  similar  statement 
applies  to  the  application  of  S4  — >  S2.  Since  H4  is  context-free,  L^Hf)  =  L(H\)  U  L(H2) 
is  context-free. 

The  Kleene  closure  of L(H\),  namely  L(H\)* ,  is  generated  by  the  context-free  grammar 
H5  =  (A7i,Ti,7e5,Si)  in  which  1Z$  =  1Z\  U  {Si  — >  e,  Si  — >  S 1 S 1 } .  To  see  this,  observe 
that  L(Hf)  includes  e,  every  string  in  L(H\),  and,  through  i  —  1  applications  of  Si  — >  Si  Si, 
every  string  in  L(H\)1 .  Thus,  L(H\)*  is  generated  by  7/5  and  is  context-free.  ■ 

We  now  use  this  result  and  Lemma  4.13.2  to  show  that  the  set  of  context-free  languages 
is  not  closed  under  complementation  and  intersection,  operations  defined  in  Section  4.6.  The 
complement  of  a  language  L  over  an  alphabet  E,  denoted  L,  is  the  set  of  strings  in  E*  that  are 
not  in  L.  The  intersection  of  two  languages  L\  and  L2,  denoted  L\  [~!  L2,  is  the  set  of  strings 
that  are  in  both  languages. 

THEOREM  4. 13.2  The  set  of  context-free  languages  is  not  closed  under  complementation  or  inter¬ 
section. 

Proof  The  intersection  of  two  languages  L\  and  L2  can  be  defined  in  terms  of  the  comple¬ 
ment  and  union  operations  as  follows: 

71n72  =  E*-  (E*  -  Lx)  U  (E*  -  L2) 

Thus,  since  the  union  of  two  CFLs  is  a  CFL,  if  the  complement  of  a  CFL  is  also  a  CFL,  from 
this  identity,  the  intersection  of  two  CFLs  is  also  a  CFL.  We  now  show  that  the  intersection 
of  two  CFLs  is  not  always  a  CFL. 

The  language  L\  =  {anbncm  \  n,  m  >  0}  is  generated  by  the  grammar  Hi  =  (A f\,  7), 

72-1,  Si),  where  A/)  =  {S,  A,  b},  7)  =  {a,  b,  c},  and  the  rules  1Z\  are: 

a)  S  — >  AB  d)  B  — >  Be 

b)  A  — >  aAb  e)  B  — >  e 

c)  A  — >  e 

The  language  L2  =  {ambncn  \n,m>  0}  is  generated  by  the  grammar  H2  =  (A/2,  7}, 
72-2;  S2),  where  A/2  =  {S,  A,  b},  7}  =  {a,  b,  c}  and  the  rules  1Z2  are: 
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a)  S  — >  AB  d)  B  — >  6bc 

b)  A  — >  aA  e)  B  — >  e 

c)  A  —>  e 

Thus,  the  languages  Lj  and  L 2  are  context-free.  However,  their  intersection  is  L i  nL2  = 
{anbncn  |  n  >  0},  which  was  shown  in  Lemma  4.13.2  not  to  be  context-free.  Thus,  the  set 
of  CFLs  is  not  closed  under  intersection,  nor  is  it  closed  under  complementation.  ■ 


Problems 

FSM  MODELS 

4.1  Let  M  =  (E,  \P,  Q,  S,  A,  s,  F)  be  the  FSM  model  described  in  Definition  3.1.1.  It 
differs  from  the  FSM  model  of  Section  4.1  in  that  its  output  alphabet  'L  has  been 
explicitly  identified.  Let  this  machine  recognize  the  language  L(M)  consisting  of  input 
strings  w  that  cause  the  last  output  produced  by  M  to  be  the  first  letter  in  *P .  Show 
that  every  language  recognized  under  this  definition  is  a  language  recognized  according 
to  the  “final-state  definition”  in  Definition  4.1.1  and  vice  versa. 

4.2  The  Mealy  machine  is  a  seven-tuple  M  =  (S,  tp,  Q,  S,  A,  s,  F )  identical  in  its  def¬ 
inition  with  the  Moore  machine  of  Definition  3.1.1  except  that  its  output  function 
A  :  Q  x  S  i— >  *P  depends  on  both  the  current  state  and  input  letter,  whereas  the  output 
function  A  :  Q  i— >  vp  0f  the  Moore  FSM  depends  only  on  the  current  state.  Show  that 
the  two  machines  recognize  the  same  languages  and  compute  the  same  functions  with 
the  exception  of  e. 

4.3  Suppose  that  an  FSM  is  allowed  to  make  state  e-transitions,  that  is,  state  transitions 
on  the  empty  string.  Show  that  the  new  machine  model  is  no  more  powerful  than  the 
Moore  machine  model. 

Hint:  Show  how  e-transitions  can  be  removed,  perhaps  by  making  the  resultant  FSM 
nondeterministic. 

EQUIVALENCE  OF  DFSMS  AND  NFSMS 

4.4  Functions  computed  by  FSMs  are  described  in  Definition  3.1.1.  Can  a  consistent 
definition  of  function  computation  by  NFSMs  be  given?  If  not,  why  not? 

4.5  Construct  a  deterministic  FSM  equivalent  to  the  nondeterministic  FSM  shown  in 
Fig.  4.32. 

REGULAR  EXPRESSIONS 

4.6  Show  that  the  regular  expression  0(0*10*)+  defines  strings  starting  with  0  and  con¬ 
taining  at  least  one  1 . 

4.7  Show  that  the  regular  expressions  0*,  0(0*10*)+,  and  1(0  +  1)*  partition  the  set  of  all 
strings  over  0  and  1 . 

4.8  Give  regular  expressions  generating  the  following  languages  over  E  =  {0, 1}: 
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Figure  4.32  A  nondeterministic  FSM. 


a)  L  =  {w  \  w  has  length  at  least  3  and  its  third  symbol  is  a  0} 

b)  L  =  {w  \  w  begins  with  a  1  and  ends  with  a  0} 

c)  L  =  {w  \  w  contains  at  least  three  Is} 

4.9  Give  regular  expressions  generating  the  following  languages  over  E  =  {0,  1 }: 

a)  L  =  {ui  |  w  is  any  string  except  1 1  and  111} 

b)  L  =  {w  |  every  odd  position  of  w  is  a  1} 

4.10  Give  regular  expressions  for  the  languages  over  the  alphabet  {0,  1,  2,  3,  4,  5,  6,  7,  8, 
9}  describing  positive  integers  that  are: 

a)  even 

b)  odd 

c)  a  multiple  of  5 

d)  a  multiple  of  4 

4.11  Give  proofs  for  the  rules  stated  in  Theorem  4.3. 1 . 

4.12  Show  that  e  +  01  +  (010)  (10  +  01 0)*(e  +1  +  01)  and  (01  +  010)*  describe  the  same 
language. 

REGULAR  EXPRESSIONS  AND  FSMS 

4.13  a)  Find  a  simple  nondeterministic  finite-state  machine  accepting  the  language  (01  U 

001  U  010)*  overE  =  {0,  1}. 

b)  Convert  the  nondeterministic  finite  state  machine  of  part  (a)  to  a  deterministic 
finite-state  machine  by  the  method  of  Section  4.2. 

4.14  a)  Let  E  =  {0,  1,2},  and  let  L  be  the  language  over  E  that  contains  each  string 

w  ending  with  some  symbol  that  does  not  occur  anywhere  else  in  w.  For  exam¬ 
ple,  011012,  20021,  11120,  0002,  10,  and  1  are  all  strings  in  L.  Construct  a 
nondeterministic  finite-state  machine  that  accepts  L. 


202 


Chapter  4  Finite-State  Machines  and  Pushdown  Automata  Models  of  Computation 


b)  Convert  the  nondeterministic  finite-state  machine  of  part  (a)  to  a  deterministic 
finite-state  machine  by  the  method  of  Section  4.2. 

4.15  Describe  an  algorithm  to  convert  a  regular  expression  to  an  NFSM  using  the  proof  of 
Theorem  4.4.1. 

4.16  Design  DFSMs  that  recognize  the  following  languages: 

a)  a*bca* 

b)  (a  +  c)*  (ab  +  ca)b* 

c)  ( a*b*(b  +  cyy 

4.17  Design  an  FSM  that  recognizes  decimal  strings  (over  the  alphabet  {0,  1,  2,  3,  4,  5,  6, 
7,  8,  9}  representing  the  integers  whose  value  is  0  modulo  3. 

Hint:  Use  the  fact  that  (10)fc  =  1  mod  3  (where  10  is  “ten”)  to  show  that  (a^(  10) fc  + 
afe_i(10)fc_1  H - b  ai(lO)1  +  a0)  mod  3  =  (a*,  +  a,k- 1  H - b  a\  +  a0)  mod  3. 

4.18  Use  the  above  FSM  design  to  generate  a  regular  expression  describing  those  integers 
whose  value  is  0  modulo  3. 

4. 19  Describe  an  algorithm  that  constructs  an  NFSM  from  a  regular  expression  r  and  accepts 
a  string  w  if  W  contains  a  string  denoted  by  r  that  begins  anywhere  in  w. 

THE  PUMPING  LEMMA 

4.20  Show  that  the  following  languages  are  not  regular: 

a)  L  =  {anban  \  n  >  0} 

b)  L  =  {0”l2n0"  |  n  >  1} 

c)  L  =  {anbncn  |  n  >  0} 

4.21  Strengthen  the  pumping  lemma  for  regular  languages  by  demonstrating  that  if  L  is 
a  regular  language  over  the  alphabet  S  recognized  by  a  DFSM  with  m  states  and  it 
contains  a  string  w  of  length  m  or  more,  then  any  substring  z  of  w  ( w  =  uzv )  of 
length  m  can  be  written  as  z  =  rst,  where  |s|  >  1  such  that  for  all  integers  n  >  0, 
ursntv  £  L.  Explain  why  this  pumping  lemma  is  stronger  than  the  one  stated  in 
Lemma  4.5.1. 

4.22  Show  that  the  language  L  =  {alV  %  >  j}  is  not  regular. 

4.23  Show  that  the  following  language  is  not  regular: 

a)  {unzvmzwn+m  |  n,m  >  1} 

PROPERTIES  OF  REGULAR  LANGUAGES 

4.24  Use  Lemma  4.5.1  and  the  closure  property  of  regular  languages  under  intersection  to 
show  that  the  following  languages  are  not  regular: 

a)  {wwR  |  w  £  {0,  1}*} 

b)  {ww  |  where  w  denotes  w  in  which  0’s  and  l’s  are  interchanged} 

c)  {tu  |  w  has  equal  number  of  0’s  and  l’s} 

4.25  Prove  or  disprove  each  of  the  following  statements: 
a)  Every  subset  of  a  regular  language  is  regular 
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b)  Every  regular  language  has  a  proper  subset  that  is  also  a  regular  language 

c)  If  L  is  regular,  then  so  is  {xy  \  x  £  L  and  y  L} 

d)  If  L  is  a  regular  language,  then  so  is  {tu  :  w  £  L  and  wR  £  L} 

e)  {iu|ttt  =  tt;fl}is  regular 

STATE  MINIMIZATION 

4.26  Find  a  minimal-state  FSM  equivalent  to  that  shown  in  Fig.  4.33. 

4.27  Show  that  the  languages  recognized  by  M  and  M=  are  the  same,  where  =  is  the  equiv¬ 
alence  relation  on  M  defined  by  states  that  are  indistinguishable  by  input  strings  of  any 
length. 

4.28  Show  that  the  equivalence  relation  Rl  is  right-invariant. 

4.29  Show  that  the  equivalence  relation  Rm  is  right-invariant. 

4.30  Show  that  the  right-invariance  equivalence  relation  (defined  in  Definition  4.7.2)  for  the 
language  L  =  {anbn  \  n  >  0}  has  an  unbounded  number  of  equivalence  classes. 

4.31  Show  that  the  DFSM  in  Fig.  4.20  is  the  machine  Ml  associated  with  the  language 
L  =  (10*1  +0)*. 

PUSHDOWN  AUTOMATA 

4.32  Construct  a  pushdown  automaton  that  accepts  the  following  language:  L  =  {u;  |  w  is 
a  string  over  the  alphabet  S  =  {(, )}  of  balanced  parentheses}. 

4.33  Construct  a  pushdown  automaton  that  accepts  the  following  language:  L  =  {u;  |  w 
contains  more  l’s  than  0’s}. 


Figure  4.33  A  four-state  finite-state  machine. 
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PHRASE  STRUCTURE  LANGUAGES 

4.34  Give  phrase-structure  grammars  for  the  following  languages: 

a)  {ww  |  w  £  {a,  6}*} 

b)  {02*  \i>l} 

4.35  Show  that  the  following  language  can  be  described  by  a  phrase-structure  grammar: 

{ a 1  |  i  is  not  prime} 


CONTEXT-SENSITIVE  LANGUAGES 

4.36  Show  that  every  context-sensitive  language  can  be  accepted  by  a  linear  bounded  au¬ 
tomaton  (LBA),  a  nondeterministic  Turing  machine  in  which  the  tape  head  visits  a 
number  of  cells  that  is  a  constant  multiple  of  the  number  of  characters  in  the  input 
string  w. 

Hint:  Consider  a  construction  similar  to  that  used  in  the  proof  of  Theorem  5.4.2. 
Instead  of  using  a  second  tape,  use  a  second  track  on  the  tape  of  the  TM. 

4.37  Show  that  every  language  accepted  by  a  linear  bounded  automaton  can  be  generated  by 
a  context-sensitive  language. 

Hint:  Consider  a  construction  similar  to  that  used  in  the  proof  of  Theorem  5.4.1  but 
instead  of  deleting  characters  at  the  end  of  TM  configuration,  encode  the  end  markers 
[  and  ]  by  enlarging  the  tape  alphabet  of  the  LBA  to  permit  the  first  and  last  characters 
to  be  either  marked  or  unmarked. 

4.38  Show  that  the  grammar  G i  in  Example  4.9.1  is  context-sensitive  and  generates  the 
language  L(Gi)  =  {anbncn  \  n  >  1}. 

4.39  Show  that  the  language  {02  |  i  >  1}  is  context-sensitive. 

4.40  Show  that  the  context-sensitive  languages  are  closed  under  union,  intersection,  and 
concatenation. 

CONTEXT-FREE  LANGUAGES 

4.41  Show  that  language  generated  by  the  context-free  grammar  G3  of  Example  4.9.3  is 
L(G3)  =  {cancancbmcbmc  \  n,m>  0}. 

4.42  Construct  context-free  grammars  for  each  of  the  following  languages: 

a)  {wwR  |  w  £  {a,  6}*} 

b)  {w  |  w  £  {a,  b}*,  w  =  wR} 

c)  L  =  {w?  |  w  has  twice  as  many  0’s  as  l’s} 

4.43  Give  a  context-free  grammars  for  each  of  the  following  languages: 

a)  {in  £  {a,  b}*  j  w  has  twice  as  many  as  as  b’s} 

b)  {arbs  j  r  <  s  <  2r} 
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REGULAR  LANGUAGES 

4.44  Show  that  the  regular  language  G4  described  in  Example  4.9.4  is  L(G 4)  =  (01)*0. 

4.45  Show  that  grammar  G  =  (A/" ,  T ,1Z,  S),  where  J\T  =  {a,  B,  s},  T  =  {a,  b}  and  the 
rules  7Z  are  given  below,  is  regular. 

a)  S  — >  abA  d)  S  — >  e  /)  B  — >  aS 

b)  S  — >  baB  e)  A  — >  6s  g)  A  —>  b 

c)  S  — >  B 

Give  a  derivation  for  the  string  abbbaa. 

4.46  Provide  a  regular  grammar  generating  strings  over  {0,  1}  not  containing  00. 

4.47  Give  a  regular  grammar  for  each  of  the  following  languages  and  show  that  there  is  a 
FSM  that  accepts  it.  In  all  cases  E  =  {0,  1}. 

a)  L  =  {w  |  the  length  of  w  is  odd} 

b)  L  =  {w  |  w  contains  at  least  three  Is} 

REGULAR  LANGUAGE  RECOGNITION 

4.48  Construct  a  finite-state  machine  that  recognizes  the  language  generated  by  the  grammar 
G  =  (A f,  T,  1Z,  S),  where  J\f  =  {s,  X,  y},  T  =  {x,  y},  and  7 Z  contains  the  following 
rules:  S  — >  xX,  S  — >  yY,  X  — >  yY,  Y  — *  xX,  X  e,  and  Y  — >  e. 

4.49  Describe  finite-state  machines  that  recognize  the  following  languages: 

a)  {to  €  {a,  6}*  |  w  has  an  odd  number  of  a’s} 

b)  {to  £  {a,  b}*  |  w  has  ab  and  ba  as  substrings} 

4.50  Show  that,  if  L  is  a  regular  language,  then  the  language  obtained  by  reversing  the  letters 
in  each  string  in  L  is  also  regular. 

4.5 1  Show  that,  if  L  is  a  regular  language,  then  the  language  consisting  of  strings  in  L  whose 
reversals  are  also  in  L  is  regular. 

PARSING  CONTEXT-FREE  LANGUAGES 

4.52  Use  the  algorithm  of  Theorem  4.1 1.2  to  construct  a  parse  tree  for  the  string  (a  *  b  + 
a)  *  (a  +  b)  generated  by  the  grammar  G 5  of  Example  4.1 1.2,  and  give  a  leftmost  and 
a  rightmost  derivation  for  the  string. 

4.53  Let  G  =  (A f,  T,  1Z,  S)  be  the  context-free  grammar  with  J\f  =  S  and  T  ={(,),  0} 

with  rules  7Z  =  {S  — >  0,  S  — >  SS,  S  — >  (s)}.  Use  the  algorithm  of  Theorem  4.11.2  to 

generate  a  parse  tree  for  the  string  (0)((0)). 

CFL  ACCEPTANCE  WITH  PUSHDOWN  AUTOMATA 

4.54  Construct  PDAs  that  accept  each  of  the  following  languages: 

a)  {anbn  |  n  >  0} 

b)  {wwR  |  w  £  {a,  6}*} 

c)  {w  |  w  £  {a,  b}*,  w  =  wR} 
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4.55  Construct  PDAs  that  accept  each  of  the  following  languages: 

a)  {in  £  {a,  6}*  |  w  has  twice  as  many  a’s  as  b’sj 

b)  {arbs  |  r  <  s  <  2r} 

4.56  Use  the  algorithm  of  Theorem  4.12.2  to  construct  a  context-free  grammar  that  accepts 
the  language  accepted  by  the  PDA  in  Example  4.8.2. 

4.57  Construct  a  context-free  grammar  for  the  language  {wcwR  \  w  £  {a,  &}*}. 

Hint:  Use  the  algorithm  of  Theorem  4.12.2  to  construct  a  context-free  grammar  that 
accepts  the  language  accepted  by  the  PDA  in  Example  4.8.1. 

PROPERTIES  OF  CONTEXT-FREE  LANGUAGES 

4.58  Show  that  the  intersection  of  a  context-free  language  and  a  regular  language  is  context- 
free. 

Hint:  From  machines  accepting  the  two  language  types,  construct  a  machine  accepting 
their  intersection. 

4.59  Suppose  that  L  is  a  context-free  language  and  R  is  a  regular  one.  Is  L  —  R  necessarily 
context-free?  What  about  R  —  12.  Justify  your  answers. 

4.60  Show  that,  if  L  is  context-free,  then  so  is  LR  =  {-ufy  |  w  £  L}. 

4.61  Let  G  =  (Af,  T,  1Z,  s)  be  context-free.  A  non-terminal  A  is  self-embedding  if  and 

only  if  A  sAu  for  some  S,u  £  T. 

a)  Give  a  procedure  to  determine  whether  A  £  J\f  is  self-embedding. 

b)  Show  that,  if  G  does  not  have  a  self-embedding  non-terminal,  then  it  is  regular. 

CFL  PUMPING  LEMMA 

4.62  Show  that  the  following  languages  are  not  context-free: 

a)  {02*  \i  >  1} 

b)  { |  n  >  1 } 

c)  {0"  |  n  is  a  prime} 

4.63  Show  that  the  following  languages  are  not  context-free: 

a)  {0"l«0nln  |  n  >  0} 

b)  {alVck  |  0  <i<j<k} 

c)  {ww  |  w  £  {0, 1}*} 

4.64  Show  that  the  language  {ww  \  w  £  {a,  &}*}  is  not  context-free. 

CFL  CLOSURE  PROPERTIES 

4.65  Let  and  M2  be  pushdown  automata  accepting  the  languages  L(M\)  and  L(M2). 

Describe  PDAs  accepting  their  union  L(M\)L)L(M2),  concatenation  L(M2), 

and  Kleene  closure  ,  thereby  giving  an  alternate  proof  of  Theorem  4.13.1. 

4.66  Use  closure  under  concatenation  of  context-free  languages  to  show  that  the  language 
{wwRvRv  |  w,  v  £  {a,  6}*}  is  context-free. 
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Chapter  Notes 

The  concept  of  the  finite-state  machine  is  often  attributed  to  McCulloch  and  Pitts  [211]. 
The  models  studied  today  are  due  to  Moore  [223]  and  Mealy  [215].  The  equivalence  of 
deterministic  and  non-deterministic  FSMs  (Theorem  4.4.1)  was  established  by  Rabin  and 
Scott  [266], 

Kleene  established  the  equivalence  of  regular  expressions  and  finite-state  machines.  The 
proof  used  in  Theorems  4.4.1  and  4.4.2  is  due  to  McNaughton  and  Yamada  [212],  The 
pumping  lemma  (Lemma  4.5.1)  is  due  to  to  Bar-Hillel,  Perles,  and  Shamir  [28].  The  closure 
properties  of  regular  expressions  are  due  to  McNaughton  and  Yamada  [212], 

State  minimization  was  studied  by  Huffman  [144]  and  Moore  [223].  The  Myhill-Nerode 
Theorem  was  independently  obtained  by  Myhill  [227]  and  Nerode  [229].  Hopcroft  [139]  has 
given  an  efficient  algorithm  for  state  miminization. 

Chomsky  [68,69]  defined  four  classes  of  formal  language,  the  regular,  context-free,  context- 
sensitive,  and  phrase-structure  languages.  He  and  Miller  [71]  demonstrated  the  equivalence 
of  languages  generated  by  regular  grammars  and  those  recognized  by  finite-state  machines. 
Chomsky  introduced  the  normal  form  that  carries  his  name  [69] .  Oettinger  [233]  introduced 
the  pushdown  automaton  and  Schutzenberger  [305],  Chomsky  [70],  and  Evey  [97]  indepen¬ 
dently  demonstrated  the  equivalence  of  context-free  languages  and  pushdown  automata. 

Two  efficient  algorithms  for  parsing  context-free  languages  were  developed  by  Earley  [94] 
and  Cocke  (unpublished)  and  independently  by  Kasami  [162]  and  Younger  [371].  These  are 
cubic-time  algorithms.  Our  formulation  of  the  parsing  algorithm  of  Section  4.11  is  based 
on  Valiant’s  derivation  [342]  of  the  Cocke-Kasami-Younger  recognition  matrix,  where  he  also 
presents  the  fastest  known  general  algorithm  to  parse  context-free  languages.  The  CFL  pump¬ 
ing  lemma  and  the  closure  properties  of  CFLs  are  due  to  Bar-Hillel,  Perles,  and  Shamir  [28] . 

Myhill  [228]  introduced  the  deterministic  linear-bounded  automata  and  Landweber  [189] 
showed  that  languages  accepted  by  linear-bounded  automata  are  context-sensitive.  Kuroda 
[184]  generalized  the  linear-bounded  automata  to  be  nondeterministic  and  established  the 
equivalence  of  such  machines  and  the  context-sensitive  languages. 
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The  Turing  machine  (TM)  is  believed  to  be  the  most  general  computational  model  that  can 
be  devised  (the  Church-Turing  thesis).  Despite  many  attempts,  no  computational  model  has 
yet  been  introduced  that  can  perform  computations  impossible  on  a  Turing  machine.  This 
is  not  a  statement  about  efficiency;  other  machines,  notably  the  RAM  of  Section  3.4,  can  do 
the  same  computations  either  more  quickly  or  with  less  memory.  Instead,  it  is  a  statement 
about  the  feasibility  of  computational  tasks.  If  a  task  can  be  done  on  a  Turing  machine,  it  is 
considered  feasible;  if  it  cannot,  it  is  considered  infeasible.  Thus,  the  TM  is  a  litmus  test  for 
computational  feasibility.  As  we  show  later,  however,  there  are  some  well-defined  tasks  that 
cannot  be  done  on  a  TM. 

The  chapter  opens  with  a  formal  definition  of  the  standard  Turing  machine  and  describes 
how  the  Turing  machine  can  be  used  to  compute  functions  and  accept  languages.  We  then 
examine  multi-tape  and  nondeterministic  TMs  and  show  their  equivalence  to  the  standard 
model.  The  nondeterministic  TM  plays  an  important  role  in  Chapter  8  in  the  classification  of 
languages  by  their  complexity.  The  equivalence  of  phrase-structure  languages  and  the  languages 
accepted  by  TMs  is  then  established.  The  universal  Turing  machine  is  defined  and  used  to 
explore  limits  on  language  acceptance  by  Turing  machines.  We  show  that  some  languages 
cannot  be  accepted  by  any  Turing  machine,  while  others  can  be  accepted  but  not  by  Turing 
machines  that  halt  on  all  inputs  (the  languages  are  unsolvable).  This  sets  the  stage  for  a  proof 
that  some  problems,  such  as  the  Halting  Problem,  are  unsolvable;  that  is,  there  is  no  Turing 
machine  halting  on  all  inputs  that  can  decide  for  an  arbitrary  Turing  machine  M  and  input 
string  w  whether  or  not  M  will  halt  on  w.  We  close  by  defining  the  partial  recursive  functions, 
the  most  general  functions  computable  by  Turing  machines. 
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Models  of  Computation 


5.1  The  Standard  Turing  Machine  Model 

The  standard  Turing  machine  consists  of  a  control  unit,  which  is  a  finite-state  machine,  and 
a  (single-ended)  infinite-capacity  tape  unit.  (See  Fig.  5.1.)  Each  cell  of  the  tape  unit  initially 
contains  the  blank  symbol  ft.  A  string  of  symbols  from  the  tape  alphabet  T  is  written  left- 
adjusted  on  the  tape  and  the  tape  head  is  placed  over  the  first  cell.  The  control  unit  then  reads 
the  symbol  under  the  head  and  makes  a  state  transition  the  result  of  which  is  either  to  write 
a  new  symbol  under  the  tape  head  or  to  move  the  head  left  (if  possible)  or  right.  (The  TM 
described  in  Section  3.7  is  slightly  different;  it  always  replaces  the  cell  contents  and  always 
issues  a  move  command,  even  if  the  effect  in  both  cases  is  null.  The  equivalence  between  the 
standard  TM  and  that  described  in  Section  3.7  is  easily  established.  See  Problem  5.1.)  A  move 
left  from  the  first  cell  leads  to  abnormal  termination,  a  problem  that  can  be  avoided  by  having 
the  Turing  machine  write  a  special  end-of-tape  marker  in  the  first  tape  cell.  This  marker  is  a 
tape  symbol  not  used  elsewhere. 

DEFINITION  5.1.1  A  standard  Turing  machine  (TM)  is  a  six-tuple  M  =  ( T,  (3,  Q,  8,  s,  h ) 

where  T  is  the  tape  alphabet  not  containing  the  blank  symbol  (3 ,  Q  is  the  finite  set  of  states, 
5  :  Q  x  ( T  U  {/?})  i— >  (Q  U  {h})  x  ( T  U  {/?}  U  {L,  R})  is  the  next-state  function,  s  is  the 
initial  state,  and  h  f  Q  is  the  accepting  halt  state.  A  TM  cannot  exit  from  h.  If  M  is  in  state 
q  with  letter  a  under  the  tape  head  and8(q,  a)  =  (q' ,  C),  its  control  unit  enters  state  q'  and  writes 
a!  if  C  =  a!  £  T  U  {/3}  or  moves  the  head  left  (if possible)  or  right  if  C  is  L  or  R,  respectively. 

The  TM  M  accepts  the  input  string  w  £  T*  (it  contains  no  blanks)  if  when  started  in  state 
s  with  w  placed  left-adjusted  on  its  otherwise  blank  tape  and  the  tape  head  at  the  leftmost  tape  cell, 
the  last  state  entered  by  M  is  h.  M  accepts  the  language  L(M)  consisting  of  all  strings  accepted 
by  M.  Languages  accepted  by  Turing  machines  are  called  recursively  enumerable.  A  language 
L  is  decidable  or  recursive  if  there  exists  a  TM  M  that  halts  on  every  input  string,  whether  in  L 
or  not,  and  accepts  exactly  the  strings  in  L. 

A  function  f  :  T*  i— >  T*  U  {  _L},  where  J_  is  a  symbol  that  is  not  in  T,  is  partial  if  for  some 
w  £  T*,  f(w)  =  T  (f  is  not  defined  on  w).  Otherwise,  f  is  total. 

A  TM  M  computes  a  function  /  iPhPUI  for  those  w  such  that  f(w)  is  defined  if 
when  started  in  state  s  with  w  placed  left-adjusted  on  its  otherwise  blank  tape  and  the  tape  head 
at  the  leftmost  tape  cell,  M  enters  the  accepting  halt  state  h  with  f(w)  written  left-adjusted  on  its 
otherwise  blank  tape.  If  a  TM  halts  on  all  inputs,  it  implements  an  algorithm.  A  task  defined  by 
a  total  function  f  is  solvable  if  f  has  an  algorithm  and  unsolvable  otherwise. 


Control 

Unit 


Figure  5. 1  The  control  and  tape  units  of  the  standard  Turing  machine. 


©John  E  Savage 


5.1  The  Standard  Turing  Machine  Model 


211 


w 


Recognizer 
(Decider) 
for  L 


(b) 


Figure  5.2  An  accepter  (a)  for  a  language  L  is  a  Turing  machine  that  can  accept  strings  in  a 
language  L  but  may  not  halt  on  all  inputs.  A  decider  or  recognizer  (b)  for  a  language  L  is  a  Turing 
machine  that  halts  on  all  inputs  and  accepts  strings  in  L. 


The  accepting  halt  state  h  has  been  singled  out  to  emphasize  language  acceptance.  How¬ 
ever,  there  is  nothing  to  prevent  a  TM  from  having  multiple  halt  states,  states  from  which  it 
does  not  exit.  (A  halt  state  can  be  realized  by  a  state  to  which  a  TM  returns  on  every  input 
without  moving  the  tape  head  or  changing  the  value  under  the  head.)  On  the  other  hand,  on 
some  inputs  a  TM  may  never  halt.  For  example,  it  may  endlessly  move  its  tape  head  right  one 
cell  and  write  the  symbol  a. 

Notice  that  we  do  not  require  a  TM  M  to  halt  on  every  input  string  for  it  to  accept  a 
language  L(M).  It  need  only  halt  on  those  strings  in  the  language.  A  language  L  for  which 
there  is  a  TM  M  accepting  L  =  L(M)  that  halts  on  all  inputs  is  decidable.  The  distinction 
between  accepting  and  recognizing  (or  deciding)  a  language  L  is  illustrated  schematically  in 
Fig.  5.2.  An  accepter  is  a  TM  that  accepts  strings  in  L  but  may  not  halt  on  strings  not  in  L. 
When  the  accepter  determines  that  the  string  w  is  in  the  language  L,  it  turns  on  the  “Yes” 
light.  If  this  light  is  not  turned  on,  it  may  be  that  the  string  is  not  in  L  or  that  the  TM  is  just 
slow.  On  the  other  hand,  a  recognizer  or  decider  is  a  TM  that  halts  on  all  inputs  and  accepts 
strings  in  L.  The  “Yes”  or  “No”  light  is  guaranteed  to  be  turned  on  at  some  time. 

The  computing  power  of  the  TM  is  extended  by  allowing  partial  computations,  com¬ 
putations  on  which  the  TM  does  not  halt  on  every  input.  The  computation  of  functions  by 
Turing  machines  is  discussed  in  Section  5.9. 

5.1.1  Programming  the  Turing  Machine 

Programming  a  Turing  machine  means  choosing  a  tape  alphabet  and  designing  its  control 
unit,  a  finite-state  machine.  Since  the  FSM  has  been  extensively  studied  elsewhere,  we  limit 
our  discussion  of  programming  of  Turing  machines  to  four  examples,  each  of  which  illustrates 
a  fundamental  point  about  Turing  machines.  Although  TMs  are  generally  designed  to  perform 
unbounded  computations,  their  control  units  have  a  bounded  number  of  states.  Thus,  we  must 
insure  that  as  they  move  across  their  tapes  they  do  not  accumulate  an  unbounded  amount  of 
information. 

A  simple  example  of  a  TM  is  one  that  moves  right  until  it  encounters  a  blank,  whereupon 
it  halts.  The  TM  of  Fig.  5.3(a)  performs  this  task.  If  the  symbol  under  the  head  is  0  or  1, 
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q  a 

S(a,q)  q  a 

5(cr,q) 

<7i  0 

qi  R  q\  0 

qi  P 

<71  1 

<h  R  q\  1 

<73  P 

Qi  P 

h  p  q\  P 

h  P 

<72  0 

q4  R 

<72  1 

<74  R 

qi  P 

<74  R 

<73  0 

<?5  R 

<73  1 

<?5  R 

<73  p 

<?5  R 

<74  0 

<72  0 

<74  1 

<73  0 

<74  p 

h  0 

<75  0 

<72  1 

95  1 

<?3  1 

95  P 

h  1 

(a)  (b) 

Figure  5.3  The  transition  functions  of  two  Turing  machines,  one  (a)  that  moves  across  the 
non-blank  symbols  on  its  tape  and  halts  over  the  first  blank  symbol,  and  a  second  (b)  that  moves 
the  input  string  right  one  position  and  inserts  a  blank  to  its  left. 


it  moves  right.  If  it  is  the  blank  symbol,  it  halts.  This  TM  can  be  extended  to  replace  the 
rightmost  character  in  a  string  of  non-blank  characters  with  a  blank.  After  finding  the  blank 
on  the  right  of  a  non-blank  string,  it  backs  up  one  cell  and  replaces  the  character  with  a  blank. 
Both  TMs  compute  functions  that  map  strings  to  strings. 

A  second  example  is  a  TM  that  replaces  the  first  letter  in  its  input  string  with  a  blank  and 
shifts  the  remaining  letters  right  one  position.  (See  Fig.  5.3(b).)  In  its  initial  state  q\  this  TM, 
which  is  assumed  to  be  given  a  non-blank  input  string,  records  the  symbol  under  the  tape  head 
by  entering  <72  if  the  letter  is  0  or  <73  if  the  letter  is  1  and  writing  the  blank  symbol.  In  its 
current  state  it  moves  right  and  enters  a  corresponding  state.  (It  enters  <74  if  its  current  state 
is  <72  and  <75  if  it  is  (73.)  In  the  new  state  it  prints  the  letter  originally  in  the  cell  to  its  left  and 
enters  either  (72  or  (73  depending  on  whether  the  current  cell  contains  0  or  1.  This  TM  can 
be  used  to  insert  a  special  end-of-tape  marker  instead  of  a  blank  to  the  left  of  a  string  written 
initially  on  a  tape.  This  idea  can  generalized  to  insert  a  symbol  anyplace  in  another  string. 

A  third  example  of  a  TM  M  is  one  that  accepts  strings  in  the  language  L  =  {anbn  cn  \  n  > 
1}.  M  inserts  an  end-of-tape  marker  to  the  left  of  a  string  w  placed  on  its  tape  and  uses  a 
computation  denoted  C(x,y),  in  which  it  moves  right  across  zero  or  more  x’s  followed  by 
zero  or  more  “pseudo-blanks”  (a  symbol  other  than  a,  b,  c,  or  P )  to  an  instance  of  y,  entering 
a  non-accepting  halt  state  /  if  some  other  pattern  of  letters  is  found.  Starting  in  the  first  cell, 
if  M  discovers  that  the  next  letter  is  not  a,  it  exits  to  state  /.  If  it  is  a,  it  replaces  a  by  a 
pseudo-blank.  It  then  executes  C (a,  b) .  M  then  replaces  6  by  a  pseudo-blank  and  executes 
C(b,c),  after  which  it  replaces  c  by  a  pseudo-blank  and  executes  C(c,  (3).  It  then  returns  to 
the  beginning  of  the  tape.  If  it  arrives  at  the  end-of-tape  marker  without  encountering  any 
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instances  of  a,  b,  or  c,  it  terminates  in  the  accepting  halt  state  h.  If  not,  then  it  moves  right 
over  pseudo-blanks  until  it  finds  an  a,  entering  state  /  if  it  finds  some  other  letter.  It  then 
resumes  the  process  executed  on  the  first  pass  by  invoking  C(a,  b).  This  computation  either 
enters  the  non-accepting  halt  state  /  or  on  each  pass  it  replaces  one  instance  each  of  a,  b,  and 
c  with  a  pseudo-blank.  Thus,  M  accepts  the  language  L  =  {anbncn  \  n  >  1};  that  is,  L 
is  decidable  (recursive).  Since  M  makes  one  pass  over  the  tape  for  each  instance  of  a,  it  uses 
time  0(n 2)  on  a  string  of  length  n.  Later  we  give  examples  of  languages  that  are  recursively 
enumerable  but  not  recursive. 

In  Section  3.8  we  reasoned  that  any  RAM  computation  can  be  simulated  by  a  Turing 
machine.  We  showed  that  any  program  written  for  the  RAM  can  be  executed  on  a  Turing 
machine  at  the  expense  of  an  increase  in  the  running  time  from  T  steps  on  a  RAM  with  S  bits 
of  storage  to  a  time  0(ST  log2  S)  on  the  Turing  machine. 

5.2  Extensions  to  the  Standard  Turing  Machine  Model 

In  this  section  we  examine  various  extensions  to  the  standard  Turing  machine  model  and 
establish  their  equivalence  to  the  standard  model.  These  extensions  include  the  multi-tape, 
nondeterministic,  and  oracle  Turing  machines. 

We  first  consider  the  double-ended  tape  Turing  machine.  Unlike  the  standard  TM  that 
has  a  tape  bounded  on  one  end,  this  is  a  TM  whose  single  tape  is  double-ended.  A  TM  of  this 
kind  can  be  simulated  by  a  two-track  one-tape  TM  by  reading  and  writing  data  on  the  top 
track  when  working  on  cells  to  the  right  of  the  midpoint  of  the  tape  and  reading  and  writing 
data  on  the  bottom  track  when  working  with  cells  to  its  left.  (See  Problem  5.7.) 

5.2.1  Multi-Tape  Turing  Machines 

A  fc-tape  Turing  machine  has  a  control  unit  and  k  single-ended  tapes  of  the  kind  shown  in 
Fig.  5.1.  Each  tape  has  its  own  head  and  operates  in  the  fashion  indicated  for  the  standard 
model.  The  FSM  control  unit  accepts  inputs  from  all  tapes  simultaneously,  makes  a  state 
transition  based  on  this  data,  and  then  supplies  outputs  to  each  tape  in  the  form  of  either  a 
letter  to  be  written  under  its  head  or  a  head  movement  command.  We  assume  that  the  tape 
alphabet  of  each  tape  is  T.  A  three-tape  TM  is  shown  in  Fig.  5.4.  A  fc-tape  TM  Mk  can  be 
simulated  by  a  one-tape  TM  M\,  as  we  now  show. 

THEOREM  5.2. 1  For  each  k-tape  Turing  machine  M k  there  is  a  one-tape  Turing  machine  M\ 
such  that  a  terminating  T -step  computation  by  M k  can  be  simulated  in  0(T2)  steps  by  M\ . 

Proof  Let  T  and  T7  be  the  tape  alphabets  of  M &  and  M \ ,  respectively.  Let  |r7|  =  (2|T|)fc 
so  that  T7  has  enough  letters  to  allow  the  tape  of  M\  to  be  subdivided  into  k  tracks,  as 
suggested  in  Fig.  5.5.  Each  cell  of  a  track  contains  2|T|  letters,  a  number  large  enough  to 
allow  each  cell  to  contain  either  a  member  of  T  or  a  marked  member  of  T.  The  marked 
members  retain  their  original  identity  but  also  contain  the  information  that  they  have  been 
marked.  As  suggested  in  Fig.  5.5  for  a  three-tape  TM,  k  heads  can  be  simulated  by  one  head 
by  marking  the  positions  of  the  k  heads  on  the  tracks  of  M\ . 

M\  simulates  Mk  in  two  passes.  First  it  visits  marked  cells  to  collect  the  letters  under 
the  original  tape  heads,  after  which  it  makes  a  state  transition  akin  to  that  made  by  Mk.  In  a 
second  pass  it  visits  the  marked  cells  either  to  change  their  entries  or  to  move  the  simulated 
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Figure  5.4  A  three-tape  Turing  machine. 
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Figure  5.5  A  single  tape  of  a  TM  with  a  large  tape  alphabet  that  simulates  a  three-tape  TM 
with  a  smaller  tape  alphabet. 


tape  heads.  If  the  k- tape  TM  executes  T  steps,  it  uses  at  most  T  +  1  tape  cells.  Thus  each 
pass  requires  O(T)  steps  and  the  complete  computation  can  be  done  in  0(T2)  steps.  ■ 

Multi-tape  machines  in  which  the  tapes  are  double-ended  are  equivalent  to  multi-tape 
single-ended  Turing  machines,  as  the  reader  can  show. 

5.2.2  Nondeterministic  Turing  Machines 

The  nondeterministic  standard  Turing  machine  (NDTM)  is  introduced  in  Section  3.7.1. 
We  use  a  slightly  altered  definition  that  conforms  to  the  definition  of  the  standard  Turing 
machine  in  Definition  5.1.1. 

DEFINITION  5.2. 1  A  nondeterministic  Turing  machine  (NDTM)  is  a  seven-tuple  M  = 
(S,  T,  /?,  Q,  5,  s,  h)  where  S  is  the  choice  input  alphabet,  T  is  the  tape  alphabet  not  con¬ 
taining  the  blank  symbol  (3,  Q  is  the  finite  set  of  states,  S  :  (J  X  E  X  (Til  {/?})  i— > 
(Q  U  {h})  x  ( T  U  {/3}  U  {L,  R})  U  {_!_}  is  the  next-state  function,  s  is  the  initial  state, 
and  h  f  Q  is  the  accepting  halt  state.  A  TM  cannot  exit  from  h.  IfM  is  in  state  q  with  letter 
a  under  the  tape  head  and  S(q,c,  a)  =  (q1 ,  C),  its  control  unit  enters  state  q'  and  writes  a'  if 
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q\  92  qk  qi 


qi 


(a) 


(b) 


Figure  5.6  The  construction  used  to  reduce  the  fan-out  of  a  nondeterministic  state. 


C  =  a'  €  r  U  {/?},  or  it  moves  the  head  left  (if  possible)  or  right  if  C  is  L  or  R,  respectively.  If 
S(q,c,a)  =  _L,  there  is  no  successor  to  the  current  state  with  choice  input  c  and  tape  symbol  a. 

An  NDTMM  reads  one  character  of  its  choice  input  string  c  £  E*  on  each  step.  An  NDTM 
M  accepts  string  w  if  there  is  some  choice  string  c  such  that  the  last  state  entered  by  M  is  h  when 
M  is  started  in  state  s  with  w  placed  lefi-adjusted  on  its  otherwise  blank  tape,  and  the  tape  head 
at  the  lefimost  tape  cell.  An  NDTM  M  accepts  the  language  L(M)  C  T*  consisting  of  those 
strings  w  that  it  accepts.  Thus,  if  w  (j)  L(M),  there  is  no  choice  input  for  which  M  accepts  w. 

If  an  NDTM  has  more  than  two  nondeterministic  choices  for  a  particular  state  and  letter 
under  the  tape  head,  we  can  design  another  NDTM  that  has  at  most  two  choices.  As  suggested 
in  Fig.  5.6,  for  each  state  q  that  has  k  possible  next  states  q\, ...  ,qk  for  some  input  letter,  we 
can  add  k  —  2  intermediate  states,  each  with  two  outgoing  edges  such  that  a)  in  each  state  the 
tape  head  doesn’t  move  and  no  change  is  made  in  the  letter  under  the  head,  but  b)  each  state 
has  the  same  k  possible  successor  states.  It  follows  that  the  new  machine  computes  the  same 
function  or  accepts  the  same  language  as  the  original  machine.  Consequently,  from  this  point 
on  we  assume  that  there  are  either  one  or  two  next  states  from  each  state  of  an  NDTM  for 
each  tape  symbol. 

We  now  show  that  the  range  of  computations  that  can  be  performed  by  deterministic  and 
nondeterministic  Turing  machines  is  the  same.  However,  this  does  not  mean  that  with  the 
identical  resource  bounds  they  compute  the  same  set  of  functions. 

THEOREM  5.2.2  Any  language  accepted  by  a  nondeterministic  standard  TM  can  be  accepted  by  a 
standard  deterministic  one. 

Proof  The  proof  is  by  simulation.  We  simulate  all  possible  computations  of  a  nondeter¬ 
ministic  standard  TM  Mnd  on  an  input  string  w  by  a  deterministic  three-tape  TM  Md 
and  halt  if  we  find  a  sequence  of  moves  by  Mnd  that  leads  to  an  accepting  halt  state.  Later 
this  machine  can  be  simulated  by  a  one-tape  TM.  The  three  tapes  of  Md  are  an  input 
tape,  a  work  tape,  and  enumeration  tape.  (See  Fig.  5.7.)  The  input  tape  holds  the  in¬ 
put  and  is  never  modified.  The  work  tape  is  used  to  simulate  Mnd-  The  enumeration 
tape  contains  choice  sequences  used  by  Md  to  decide  which  move  to  make  when  simu- 
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Figure  5.7  A  three-tape  deterministic  Turing  machine  that  simulates  a  nondeterministic  Turing 
machine. 


lating  Mnd-  These  sequences  are  generated  in  lexicographical  order,  that  is,  in  the  order 
0,  1,  00,  01, 10,  11,  000,  001, ... .  It  is  straightforward  to  design  a  deterministic  TM  that 
generates  these  sequences.  (See  Problem  5.2.) 

Breadth-first  search  is  used.  Since  a  string  w  is  accepted  by  a  nondeterministic  TM  if 
there  is  some  choice  input  on  which  it  is  accepted,  a  deterministic  TM  Md  that  accepts  the 
input  w  accepted  by  JWnd  can  be  constructed  by  erasing  the  work  tape,  copying  the  input 
sequence  w  to  the  work  tape,  placing  the  next  choice  input  sequence  in  lexicographical  or¬ 
der  on  the  enumeration  tape  (initially  this  is  the  sequence  0),  and  then  simulating  Mnd  on 
the  work  tape  while  reading  one  choice  input  from  the  enumeration  tape  on  each  step.  If 
Md  runs  out  of  choice  inputs  before  reaching  the  halt  state,  the  above  procedure  is  restarted 
with  the  next  choice  input  sequence.  This  method  deterministically  accepts  the  input  string 
w  if  and  only  if  there  is  some  choice  input  to  Mnd  on  which  it  is  accepted.  ■ 

Adding  more  than  one  tape  to  a  nondeterministic  Turing  machine  does  not  increase  its 
computing  power.  To  see  this,  it  suffices  to  simulate  a  multi-tape  nondeterministic  Turing 
machine  with  a  single-tape  one,  using  a  construction  parallel  to  that  of  Theorem  5.2.1,  and 
then  invoke  the  above  result.  Applying  these  observations  to  language  acceptance  yields  the 
following  corollary. 

COROLLARY  5.2.1  Any  language  accepted  by  a  nondeterministic  ( multi-tape)  Turing  machine  can 
be  accepted  by  a  deterministic  standard  Turing  machine. 

We  emphasize  that  this  result  does  not  mean  that  with  identical  resource  bounds  the  de¬ 
terministic  and  nondeterministic  Turing  machines  compute  the  same  set  of  functions. 

5.2.3  Oracle  Turing  Machines 

The  oracle  Turing  machine  (OTM)  is  a  multi-tape  TM  or  NDTM  with  a  special  oracle 
tape  and  an  associated  oracle  function  h  :  B*  i— >  B* ,  which  need  not  be  computable.  (See 
Fig.  5.8.)  After  writing  a  string  z  on  its  oracle  tape,  the  OTM  signals  to  the  oracle  to  replace 
z  with  the  value  h(z)  of  the  oracle  function.  During  a  computation  the  OTM  may  consult 
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Figure  5.8  The  oracle  Turing  machine  has  an  “oracle  tape”  on  which  it  writes  a  string  (a  problem 
instance),  after  which  an  “oracle”  returns  an  answer  in  one  step. 


the  oracle  as  many  times  as  it  wishes.  Time  on  an  OTM  is  the  number  of  steps  taken,  where 
one  consultation  of  the  oracle  is  counted  as  one  step.  Space  is  the  number  of  cells  used  on 
the  work  tapes  of  an  OTM  not  including  the  oracle  tape.  The  OTM  machine  can  be  used  to 
classify  problems.  (See  Problem  8.15.) 

5.2.4  Representing  Restricted  Models  of  Computation 

Now  that  we  have  introduced  a  variety  of  Turing  machine  models,  we  ask  how  the  finite-state 
machine  and  pushdown  automaton  fit  into  the  picture. 

The  finite-state  machine  can  be  viewed  as  a  Turing  machine  with  two  tapes,  the  first  a 
read-only  input  tape  and  the  second  a  write-only  output  tape.  This  TM  reads  consecutive 
symbols  on  its  input  tape,  moving  right  after  reading  each  symbol,  and  writes  outputs  on  its 
output  tape,  moving  right  after  writing  each  symbol.  If  this  TM  enters  an  accepting  halt  state, 
the  input  sequence  read  from  the  tape  is  accepted. 

The  pushdown  automaton  can  be  viewed  as  a  Turing  machine  with  two  tapes,  a  read-only 
input  tape  and  a  pushdown  tape.  The  pushdown  tape  is  a  standard  tape  that  pushes  a  new 
symbol  by  moving  its  head  right  one  cell  and  writing  the  new  symbol  into  this  previously 
blank  cell.  It  pops  the  symbol  at  the  top  of  the  stack  by  copying  the  symbol,  after  which  it 
replaces  it  with  the  blank  symbol  and  moves  its  head  left  one  cell. 

The  Turing  machine  can  be  simulated  by  two  pushdown  tapes.  The  movement  of  the  head 
in  one  direction  can  be  simulated  by  popping  the  top  item  of  one  stack  and  pushing  it  onto 
the  other  stack.  To  simulate  the  movement  of  the  head  in  the  opposite  direction,  interchange 
the  names  of  the  two  stacks. 

The  nondeterministic  equivalents  of  the  finite-state  machine  and  pushdown  automaton 
are  obtained  by  making  their  Turing  machine  control  units  nondeterministic. 
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5.3  Configuration  Graphs 

We  now  introduce  configuration  graphs,  graphs  that  capture  the  state  of  Turing  machines 
with  potentially  unlimited  storage  capacity.  We  begin  by  describing  configuration  graphs  for 
one-tape  Turing  machines. 

DEFINITION  5.3. 1  The  configuration  of  a  standard  Turing  machine  M  at  any  point  in  time 
is  [x\X2  ■  ■  ■  pxj  . . .  xn],  where  p  is  the  state  of  the  control  unit,  the  tape  head  is  over  the  jth  tape 
cell,  and  x  =  (xj,  X2, . . . ,  xn)  is  the  string  that  contains  all  the  non-blank  symbols  on  the  tape  as 
well  as  the  symbol  under  the  head.  Here  the  state  p  is  shown  in  boldface  to  the  left  of  the  symbol  Xj  to 
indicate  that  the  tape  head  is  over  the  jth  cell.  xn  and  some  of  the  symbols  to  its  left  may  be  blanks. 

To  illustrate  such  configurations,  consider  a  TM  M  that  is  in  state  p  reading  the  third 
symbol  on  its  tape,  which  contains  xyz.  This  information  is  captured  by  the  configuration 
[xypz\.  If  M  changes  to  state  q  and  moves  its  head  right,  then  its  new  configuration  is 
[xysq/3].  In  this  case  we  add  a  blank  /3  to  the  right  of  the  string  xyz  to  insure  that  the  head 
resides  over  the  string. 

Because  multi-tape  TMs  are  important  in  classifying  problems  by  their  use  of  temporary 
work  space,  a  definition  for  the  configuration  of  a  multi-tape  TM  is  desirable.  We  now  intro¬ 
duce  a  notation  for  this  purpose  that  is  somewhat  more  cumbersome  than  used  for  the  standard 
TM.  This  notation  uses  an  explicit  binary  number  for  the  position  of  each  tape  head. 

DEFINITION  5.3.2  The  configuration  of  a  k- tape  Turing  machine  M  is  (p,  hi,  hi,  ■  ■  ■ ,  hi,, 

X\,  X2, . .  ■ ,  xf),  where  hr  is  the  position  of  the  head  in  binary  on  the  rth  tape,  p  is  the  state  of 
the  control  unit,  and  xr  is  the  string  on  the  rth  tape  that  includes  all  the  non-blank  symbols  as  well 
as  the  symbol  under  the  head. 

We  now  define  configuration  graphs  for  deterministic  TMs  and  NDTMs.  Because  we  will 
apply  configuration  graphs  to  machines  that  halt  on  all  inputs,  we  view  them  as  acyclic. 

DEFINITION  5.3.3  A  configuration  graph  G(Mnd>  w)  associated  with  the  NDTM  Mnd  is  a 
directed  graph  whose  vertices  are  configurations  o/"Mnd-  (See  Fig.  5.9.)  There  is  a  directed  edge 
between  two  vertices  if  for  some  choice  input  vector  c  Mnd  can  move  from  the  first  configuration  to 


Figure  5.9  The  configuration  graph  G(Mnd>  in)  of  a  nondeterministic  Turing  machine  Mnd 
on  input  w  has  one  vertex  for  each  configuration  of  Mnd  ■  The  graph  is  acyclic.  Heavy  edges 
identify  the  nondeterministic  choices  associated  with  each  configuration. 
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the  second  in  one  step.  There  is  one  configuration  corresponding  to  the  initial  state  of  the  machine 
and  one  corresponding  to  the  final  state.  (We  assume  without  loss  of  generality  that,  afier  accepting 
an  input  string,  .Mnd  enters  a  cleanup  phase  during  which  it  places  a  fixed  string  on  each  tape.) 

Configuration  graphs  are  used  in  the  next  section  to  associate  a  phrase-structure  language 
with  a  Turing  machine.  They  are  also  used  in  many  places  in  Chapter  8,  especially  in  Sec¬ 
tion  8.5.3,  where  they  are  used  to  establish  an  important  relationship  between  deterministic 
and  nondeterministic  space  classes. 

5.4  Phrase-Structure  Languages  and  Turing  Machines 

We  now  demonstrate  that  the  phrase-structure  languages  and  the  languages  accepted  by  Turing 
machines  are  the  same.  We  begin  by  showing  that  every  recursively  enumerable  language 
is  a  phrase-structure  language.  For  this  purpose  we  use  configurations  of  one-tape  Turing 
machines.  Then,  for  each  phrase-structure  language  L  we  describe  the  construction  of  a  TM 
accepting  L.  We  conclude  that  the  languages  accepted  by  TMs  and  described  by  phrase- 
structure  grammars  are  the  same. 

With  these  conventions  as  background,  if  a  standard  TM  halts  in  its  accepting  halt  state, 
we  can  require  that  it  halt  with  /3 1/3  on  its  tape  when  it  accepts  the  input  string  w.  Thus, 
the  TM  configuration  when  a  TM  halts  and  accepts  its  input  string  is  [h/31/3],  Its  starting 
configuration  is  [s(3w\W2  ■  ■  ■  wn/3\,  where  w  =  W\ W2  ■  ■  ■  wn. 

THEOREM  5.4. 1  Every  recursively  enumerable  language  is  a  phrase-structure  language. 

Proof  Let  M  =  ( T,  (3,  Q,  S,  s,  h )  be  a  deterministic  TM  and  let  L(M)  be  the  recursively 
enumerable  language  over  the  alphabet  T  that  it  accepts.  The  goal  is  to  show  the  existence  of 
a  phrase-structure  grammar  G  =  ( A f,  T,  1Z,  S)  that  can  generate  each  string  w  of  L,  and  no 
others.  Since  the  TM  accepting  L  halts  with  /31/3  on  its  tape  when  started  with  w  £  L,  we 
design  a  grammar  G  that  produces  the  configurations  of  M  in  reverse  order.  Starting  with 
the  final  configuration  [h/31/3],  G  produces  the  starting  configuration  [ sfiw\W2  ■  ■  .wn/3\, 
where  w  =  W\W2  ■  ■  ■  wn,  after  which  it  strips  off  the  characters  [s/3  at  the  beginning  and 
/3] .  The  grammar  G  defined  below  serves  this  purpose,  as  we  show. 


\f  = 

Q  U  {s,/3, 

[,  ] }  and  T 

=  T.  The  rules  1Z  of  G  are  defined  as  follows: 

(a) 

s  — > 

[h/31/3] 

(b) 

0\  - 

m 

(c) 

[s/3  -> 

e 

(d) 

m  -* 

P] 

(e) 

p]  - 

e 

(f) 

xq  — > 

px 

for  all  p  £  Q  and  x  £  (  T  U  {/3}) 
such  that  S(p,  x )  =  ( q ,  R) 

(g) 

q  zx  — > 

zpx 

for  all  p  £  Q  and  x,  z  £  ( T  U  {/?}) 
such  that  S(p,  x)  =  ( q ,  L) 

(h) 

q  y 

px 

for  all  p  £  Q  and  (TU  {/?}) 

such  that  S(p,x)  =  (q,y),  y  G  (TU  {/3}) 

These  rules  are  designed  to  start  with  the  transition  S  — >  [h/31/3]  (Rule  (a))  and  then 
rewrite  [h/31/3]  using  other  rules  until  the  configuration  [sf5w\W2  ■  ■  ■  wn(3\  is  reached.  At 
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this  point  Rule  (c)  is  invoked  to  strip  [s/3  from  the  beginning  of  the  string,  and  Rule  (e)  strips 
/3]  from  the  end,  thereby  producing  the  string  W\,u>2,  ■  ■  ■ ,  Wn  that  was  written  initially  on 
M’s  tape. 

Rule  (b)  is  used  to  add  blank  space  at  the  right-hand  end  of  the  tape.  Rules  (f)-(h) 
mimic  the  transitions  of  M  in  reverse  order.  Rule  (f)  says  that  if  M  in  state  p  reading  x 
moves  to  state  q  and  moves  its  head  right,  then  M’s  configuration  contained  the  substring 
pa;  before  the  move  and  :rq  after  it.  Thus,  we  map  x<\  into  pa;  with  the  rule  a:q  — >  pa;. 
Similar  reasoning  is  applied  to  Rule  (g).  If  the  transition  S(p,x)  =  ( q,y ),  y  £  T  U  {/3} 
is  executed,  M’s  configuration  contained  the  substring  pa;  before  the  step  and  qy  after  it 
because  the  head  does  not  move. 

Clearly,  every  computation  by  a  TM  M  can  be  described  by  a  sequence  of  configurations 
and  the  transitions  between  these  configurations  can  be  described  by  this  grammar  G.  Thus, 
the  strings  accepted  by  M  can  be  generated  by  G.  Conversely,  if  we  are  given  a  derivation 
in  G,  it  produces  a  series  of  configurations  characterizing  computations  by  the  TM  M  in 
reverse  order.  Thus,  the  strings  generated  by  G  are  the  strings  accepted  by  M .  ■ 

By  showing  that  every  phrase-structure  language  can  be  accepted  by  a  Turing  machine,  we 
will  have  demonstrated  the  equivalence  between  the  phrase-structure  and  recursively  enumer¬ 
able  languages. 

THEOREM  5.4.2  Every  phrase-structure  language  is  recursively  enumerable. 

Proof  Given  a  phrase-structure  grammar  G,  we  construct  a  nondeterministic  two-tape  TM 
M  with  the  property  that  L(G )  =  L(M).  Because  every  language  accepted  by  a  multi-tape 
TM  is  accepted  by  a  one-tape  TM  and  vice  versa,  we  have  the  desired  conclusion. 

To  decide  whether  or  not  to  accept  an  input  string  placed  on  its  first  (input)  tape,  M 
nondeterministically  generates  a  terminal  string  on  its  second  (work)  tape  using  the  rules  of 
G.  To  do  so,  it  puts  G’s  start  symbol  on  its  work  tape  and  then  nondeterministically  expands 
it  into  a  terminal  string  using  the  rules  of  G.  After  producing  a  terminal  string,  M  compares 
the  input  string  with  the  string  on  its  work  tape.  If  they  agree  in  every  position,  M  accepts 
the  input  string.  If  not,  M  enters  an  infinite  loop.  To  write  the  derived  strings  on  its  work 
tape,  M  must  either  replace,  delete,  or  insert  characters  in  the  string  on  its  tape,  tasks  well 
suited  to  Turing  machines. 

Since  it  is  possible  for  M  to  generate  every  string  in  L(G)  on  its  work  tape,  it  can  accept 
every  string  in  L{G).  On  the  other  hand,  every  string  accepted  by  M  is  a  string  that  it  can 
generate  using  the  rules  of  G.  Thus,  every  string  accepted  by  M  is  in  L(G).  It  follows  that 

L(M)  =  L(G).  m 

This  last  result  gives  meaning  to  the  phrase  “recursively  enumerable”;  the  languages  ac¬ 
cepted  by  Turing  machines  (the  recursively  enumerable  languages)  are  languages  whose  strings 
can  be  enumerated  by  a  Turing  machine  (a  recursive  device).  Since  an  NDTM  can  be  simu¬ 
lated  by  a  DTM,  all  strings  accepted  by  a  TM  can  be  generated  deterministically  in  sequence. 

5.5  Universal  Turing  Machines 

A  universal  Turing  machine  is  a  Turing  machine  that  can  simulate  the  behavior  of  an  arbitrary 
Turing  machine,  even  the  universal  Turing  machine  itself.  To  give  an  explicit  construction  for 
such  a  machine,  we  show  how  to  encode  Turing  machines  as  strings. 
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Without  loss  of  generality  we  consider  only  deterministic  Turing  machines  M  =  ( T,  (3,Q, 
8,  s,  h)  that  have  a  binary  tape  alphabet  T  =  B  =  {0,  1}.  When  M  is  in  state  p  and  the 
value  under  the  head  is  a,  the  next-state  function  8  :  Q  x  (T  U  {/?})  i— >  (Q  U  {h})  x 
( T  U  {/?}  U  {L,  R})  takes  M  to  state  q  and  provides  output  z,  where  8(p,  a)  =  ( q ,  z)  and 
zeru^Ju  {L,  R}. 

We  now  specify  a  convention  for  numbering  states  that  simplifies  the  description  of  the 
next-state  function  <5  of  M. 

DEFINITION  5.5.1  The  canonical  encoding  of  a  Turing  machine  M,  p(M),  is  a  string  over  the 
10-letter  alphabet  A  =  {<,>,[,],  f,  0,  1,  (3,  R,  L}  formed  as  follows: 

(a)  Let  Q  =  {q\,qi,  ■  ■  ■  ,qk]  where  s  =  q\.  Represent  state  q;  in  unary  notation  by  the  string 
1*.  The  halt  state  h  is  represented  by  the  empty  string. 

(b)  Let  ( q ,  z)  be  the  value  of  the  next-state  function  when  M  is  in  state  p  reading  a  under 
its  tape  head;  that  is,  S(p ,  a)  =  ( q ,  z).  Represent  ( q ,  z)  by  the  string  <  zfq  >  in  which  q  is 
represented  in  unary  and  z  G  {0,  1,/?,  L,  R}.  If  q  =  h,  the  value  of  the  next-state  function  is 

<  z#  >■ 

(c)  For  p  G  Q,  the  three  values  <  z'fq'  >,  <  z"fq"  >,  and  <  z"'fq'"  >  ofS(p,0), 
8(p,  1),  andS(p,(3 )  are  assembled  as  a  triple  [<  z'fq'  ><  z''fq"  ><  z'"fq"'  >].  The 
complete  description  of  the  next-state  function  8  is  given  as  a  sequence  of  such  triples,  one  for  each 
state  p  G  Q. 

To  illustrate  this  definition,  consider  the  two  TMs  whose  next-state  functions  are  shown  in 
Fig.  5.3.  The  first  moves  across  the  non-blank  initial  string  on  its  tape  and  halts  over  the  first 
blank  symbol.  The  second  moves  the  input  string  right  one  position  and  inserts  a  blank  to  its 
left.  The  canonical  encoding  of  the  first  TM  is  [<  Kf  1  >  <  Kf  1  >  <  >]  whereas  that 

of  the  second  is 

[</?#  11  >  </3#Hl>  </?#>] 

[<  R#  1 1 1 1  >  <  R#  1 1 1 1  >  <  R#  1111  >] 

[<  R#  11111  >  <  R#lllll  >  <  Kf  11111  >] 

[<0#11>  <0#111>  <  0#  >] 

[<1#H>  <1#1H>  <1#>] 

It  follows  that  the  canonical  encodings  of  TMs  are  a  subset  of  the  strings  defined  by  the 
regular  expression  ([(<  {0,  1,  /?,  L,  R}$T*  >)3])*  which  a  TM  can  analyze  to  insure  that  for 
each  state  and  tape  letter  there  is  a  valid  action. 

A  universal  Turing  machine  (UTM)  U  is  a  Turing  machine  that  is  capable  of  simulating 
an  arbitrary  Turing  machine  on  an  arbitrary  input  word  w.  The  construction  of  a  UTM  based 
on  the  simulation  of  the  random-access  machine  is  described  in  Section  3.8.  Here  we  describe 
a  direct  construction  of  a  UTM. 

Let  the  UTM  U  have  a  20-letter  alphabet  A  containing  the  10  symbols  in  A  plus  another 
10  symbols  that  are  marked  copies  of  the  symbols  in  A.  (The  marked  copies  are  used  to 
simulate  multiple  tracks  on  a  one-track  TM.)  That  is,  we  define  A  as  follows: 

A=  {<,>,[,],#,  0,1,  /3,R,L}U{<,  >,!,?>#>  0,1,  %  R,L} 

To  simulate  the  TM  M  on  the  input  string  w,  we  place  M’s  canonical  encoding,  p(M), 
on  the  tape  of  the  UTM  U  preceded  by  (3  and  followed  by  w,  as  suggested  in  Fig.  5.10.  The 
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Figure  5.10  The  initial  configuration  of  the  tape  of  a  universal  TM  that  is  prepared  to  simulate 
the  TM  M  on  input  w.  The  left  end-of-tape  marker  is  the  blank  symbol  /3. 


first  letter  of  w  follows  the  rightmost  bracket,  ] ,  and  is  marked  by  replacing  it  with  its  marked 
equivalent,  ui\.  The  current  state  q  of  M  is  identified  by  replacing  the  left  bracket,  [,  in  q’s 
triple  by  its  marked  equivalent,  [.  U  simulates  M  by  reading  the  marked  input  symbol  a, 
the  one  that  resides  under  M’s  simulated  head,  and  advancing  its  own  head  to  the  triple  to 
the  right  of  [  that  corresponds  to  a.  (Before  it  moves  its  head,  it  replaces  [  with  [.)  That  is,  it 
advances  its  head  to  the  first,  second,  or  third  triple  associated  with  the  current  state  depending 
on  whether  a  is  0,  1,  or  /?.  It  then  changes  <  to  <,  moves  to  the  symbol  following  <  and  takes 
the  required  action  on  the  simulated  tape.  If  the  action  requires  writing  a  symbol,  it  replaces  a 
with  a  new  marked  symbol.  If  it  requires  moving  M’s  head,  the  marking  on  a  is  removed  and 
the  appropriate  adjacent  symbol  is  marked.  U  returns  to  <  and  removes  the  mark. 

The  UTM  U  moves  to  the  next  state  as  follows.  It  moves  its  head  three  places  to  the 
right  of  <  after  changing  it  to  <,  at  which  point  it  is  to  the  right  of  #,  over  the  first  digit 
representing  the  next  state.  If  the  symbol  in  this  position  is  >,  the  next  state  is  h ,  the  halting 
state,  and  the  UTM  halts.  If  the  symbol  is  I ,  U  replaces  it  with  1  and  then  moves  its  head 
left  to  the  leftmost  instance  of  [  (the  leftmost  tape  cell  contains  (3,  an  end-of  tape  marker).  It 
marks  [  and  returns  to  1 .  It  replaces  1  with  1  and  moves  its  head  right  one  place.  If  U  finds  the 
symbol  1 ,  it  marks  it,  moves  left  to  [,  restores  it  to  [  and  then  moves  right  to  the  next  instance 
of  [  and  marks  it.  It  then  moves  right  to  1  and  repeats  this  operation.  However,  if  the  UTM 
finds  the  symbol  >,  it  has  finished  updating  the  current  state  so  it  moves  right  to  the  marked 
tape  symbol,  at  which  point  it  reads  the  symbol  under  M’s  head  and  starts  another  transition 
cycle.  The  details  of  this  construction  are  left  to  the  reader.  (See  Problem  5.15.) 

5.6  Encodings  of  Strings  and  Turing  Machines 

Given  an  alphabet  A  with  an  ordering  of  its  letters,  strings  over  this  alphabet  have  an  order 
known  as  the  standard  lexicographical  order,  which  we  now  define.  In  this  order,  strings  of 
length  n  —  1  precede  strings  of  length  n.  Thus,  if  A  =  {0, 1,  2},  201  <  0001.  Among  the 
strings  of  length  n,  if  a  and  b  are  in  A  and  a  <  b,  then  all  strings  beginning  with  a  precede 
those  beginning  with  b.  For  example,  if  0  <  1  <  2  in  A  =  {0,  1,  2},  then  022  <  200.  If  two 
strings  of  length  n  have  the  same  prefix  u,  the  ordering  between  them  is  determined  by  the 
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order  of  the  next  letter.  For  example,  for  the  alphabet  A  and  the  ordering  given  on  its  letters, 
201021  <  201200. 

A  simple  algorithm  produces  the  strings  over  an  alphabet  in  lexicographical  order.  Strings 
of  length  1  are  produced  by  enumerating  the  letters  from  the  alphabet  in  increasing  order. 
Strings  of  length  n  are  enumerated  by  choosing  the  first  letter  from  the  alphabet  in  increasing 
order.  The  remaining  n  —  1  letters  are  generated  in  lexicographical  order  by  applying  this 
algorithm  recursively  on  strings  of  length  n  —  1 . 

To  prepare  for  later  results,  we  observe  that  it  is  straightforward  to  test  an  arbitrary  string 
over  the  alphabet  A  given  in  Definition  5.5.1  to  determine  if  it  is  a  canonical  description  p(M) 
of  a  Turing  machine  M .  Each  must  be  contained  in  ([(<  {0,  1,  /3,  L,  R}#1  *  >)3])*  and  have 
a  transition  for  each  state  and  tape  letter.  If  a  putative  encoding  is  not  canonical,  we  associate 
with  it  the  two-state  null  TM  Tnuu  with  next-state  function  satisfying  S(s,  a)  =  (h,  a)  for  all 
tape  letters  a.  This  encoding  associates  a  Turing  machine  with  each  string  over  the  alphabet  A. 

We  now  show  how  to  identify  the  j th  Turing  machine,  Mj .  Given  an  order  to  the 
symbols  in  A,  strings  over  this  alphabet  are  generated  in  lexicographical  order.  We  define  the 
null  TM  to  be  the  zeroth  TM.  Each  string  over  A  that  is  not  a  canonical  encoding  is  associated 
with  this  machine.  The  first  TM  is  the  one  described  by  the  lexicographically  first  string  over 
A  that  is  a  canonical  encoding.  The  second  TM  is  described  by  the  second  canonical  encoding, 
etc.  Not  only  does  a  TM  determine  which  string  is  a  canonical  encoding,  but  when  combined 
with  an  algorithm  to  generate  strings  in  lexicographical  order,  this  procedure  also  assigns  a 
Turing  machine  to  each  string  and  allows  the  j  th  Turing  machine  to  be  found. 

Observe  that  there  is  no  loss  in  generality  in  assuming  that  the  encodings  of  Turing  ma¬ 
chines  are  binary  strings.  We  need  only  create  a  mapping  from  the  letters  in  the  alphabet  A 
to  binary  strings.  Since  it  may  be  necessary  to  use  marked  letters,  we  can  assume  that  the  20 
strings  in  A  are  available  and  are  encoded  into  5-bit  binary  strings.  This  allows  us  to  view 
encodings  of  Turing  machines  as  binary  strings  but  to  speak  of  the  encodings  in  terms  of  the 
letters  in  the  alphabet  A. 

5.7  Limits  on  Language  Acceptance 

A  language  L  that  is  decidable  (also  called  recursive)  has  an  algorithm,  a  Turing  machine 
that  halts  on  all  inputs  and  accepts  just  those  strings  in  L.  A  language  for  which  there  is  a 
Turing  machine  that  accepts  just  those  strings  in  L,  possibly  not  halting  on  strings  not  in  L, 
is  recursively  enumerable.  A  language  that  is  recursively  enumerable  but  not  decidable  is 

unsolvable. 

We  begin  by  describing  some  decidable  languages  and  then  exhibit  a  language,  £|,  that 
is  not  recursively  enumerable  (no  Turing  machine  exists  to  accepts  strings  in  it)  but  whose 
complement,  £2,  is  recursively  enumerable  but  not  decidable;  that  is,  £2  is  unsolvable.  We  use 
the  language  £2  to  show  that  other  languages,  including  the  halting  problem,  are  unsolvable. 

5.7.1  Decidable  Languages 

Our  first  decidable  problem  is  the  language  of  pairs  of  regular  expressions  and  strings  such  that 
the  regular  expression  describes  a  language  containing  the  corresponding  string: 

£rx  =  {7?,  w  j  w  is  in  the  language  described  by  the  regular  expression  R} 
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THEOREM  5.7. 1  The  language  £rx  »  decidable. 

Proof  To  decide  on  a  string  R,  w,  use  the  method  of  Theorem  4.4.1  to  construct  a  NFSM 
M\  that  accepts  the  language  described  by  R.  Then  invoke  the  method  of  Theorem  4.2.1 
to  construct  a  DFSM  M2  accepting  the  same  language  as  M\.  The  string  w  is  given  to  M2, 
which  accepts  it  if  R  can  generate  it  and  rejects  it  otherwise.  This  procedure  decides  £rx 
because  it  halts  on  all  strings  R,  w,  whether  in  £rx  or  not.  ■ 

As  a  second  example,  we  show  that  finite-state  machines  that  recognize  empty  languages 
are  decidable.  Flere  an  FSM  encoded  as  Turing  machine  reads  one  input  from  the  tape  per 
step  and  makes  a  state  transition,  halting  when  it  reaches  the  blank  letter. 

THEOREM  5.7.2  The  language  L  =  {p(M)  \  M  is  a  DFSM  and  L(M)  =  0}  is  decidable. 

Proof  L(M)  is  not  empty  if  there  is  some  string  w  it  can  accept.  To  determine  if  there 
is  such  a  string,  we  use  a  TM  M'  that  executes  a  breadth-first  search  on  the  graph  of  the 
DFSM  M  that  is  provided  as  input  to  M' .  M'  first  marks  the  initial  state  of  M  and  then 
repeatedly  marks  any  state  that  has  not  been  marked  previously  and  can  be  reached  from  a 
marked  state  until  no  additional  states  can  be  marked.  This  process  terminates  because  M 
has  a  finite  number  of  states.  Finally,  M'  checks  to  see  if  there  is  a  marked  accepting  state 
that  can  be  reached  from  the  initial  state,  rejecting  the  input  p(M)  if  so  and  accepting  it  if 
not.  ■ 

The  third  language  describes  context-free  grammars  generating  languages  that  are  empty. 
Here  we  encode  the  definition  of  a  context-free  grammar  G  as  a  string  p(G)  over  a  small 
alphabet. 

THEOREM  5.7.3  The  language  L  =  {p{G)  |  G  is  a  CFG  and  L(G )  =  0}  is  decidable. 

Proof  We  design  a  TM  M'  that,  when  given  as  input  a  description  p(G )  of  a  CFG  G, 
first  marks  all  the  terminals  of  the  grammar  and  then  scans  all  the  rules  of  the  grammar, 
marking  non-terminal  symbols  that  can  be  replaced  by  some  marked  symbols.  (If  there  is  a 
non-terminal  A  that  it  is  not  marked  and  there  is  a  rule  A  — »  BCD  in  which  B,C,  D  have 
already  been  marked,  then  the  TM  also  marks  A.)  We  repeat  this  procedure  until  no  new 
non-terminals  can  be  marked.  This  process  terminates  because  the  grammar  G  has  a  finite 
number  of  non-terminals.  If  S  is  not  marked,  we  accept  p{G).  Otherwise,  we  reject  p{G) 
because  it  is  possible  to  generate  a  string  of  terminals  from  S.  ■ 

5.7.2  A  Language  That  Is  Not  Recursively  Enumerable 

Not  unexpectedly,  there  are  well-defined  languages  that  are  not  recursively  enumerable,  as  we 
show  in  this  section.  We  also  show  that  the  complement  of  a  decidable  language  is  decidable. 
This  allows  us  to  exhibit  a  language  that  is  recursively  enumerable  but  undecidable. 

Consider  the  language  C\  defined  below.  It  contains  the  ith  binary  input  string  if  it  is  not 
accepted  by  the  ith  Turing  machine. 

£1  =  {wi  |  Wi  is  not  accepted  by  M.; } 

THEOREM  5.7.4  The  language  C\  is  not  recursively  enumerable;  that  is,  no  Turing  machine  exists 
that  can  accept  all  the  strings  in  this  language. 
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p{Mx)  p(M2)  p{Mk) 


reject 

accept 

reject 

accept 

accept 

reject 

accept 

reject 

Figure  5.11  A  table  whose  rows  and  columns  are  indexed  by  input  strings  and  Turing  ma¬ 
chines,  respectively.  Here  Wi  is  the  zth  input  string  and  p(Mj)  is  the  encoding  of  the  jth  Turing 
machine.  The  entry  in  row  i,  column  j  indicates  whether  or  not  Mj  accepts  w, .  The  language 
£i  consists  of  input  strings  Wj  for  which  the  entry  in  the  Jth  row  and  Jth  column  is  reject. 


Proof  We  use  proof  by  contradiction;  that  is,  we  assume  the  existence  of  a  TM  Mk  that 
accepts  C\.  If  wk  is  in  C\,  then  Mk  accepts  it,  contradicting  the  definition  of  C\.  This 
implies  that  wk  is  not  in  C\.  On  the  other  hand,  if  wk  is  not  in  C\,  then  it  is  not  accepted 
by  Mk.  It  follows  from  the  definition  of  C\  that  wk  is  in  C\.  Thus,  wk  is  in  C\  if  and  only 
if  it  is  not  in  C\ .  We  have  a  contradiction  and  no  Turing  machine  accepts  C\ .  ■ 

This  proof  uses  diagonalization.  (See  Fig.  5.11.)  In  effect,  we  construct  an  infinite  two- 
dimensional  matrix  whose  rows  are  indexed  by  input  words  and  whose  columns  are  indexed 
by  Turing  machines.  The  entry  in  row  i  and  column  j  of  this  matrix  specifies  whether  or  not 
input  word  Wj  is  accepted  by  Mj.  The  language  C\  contains  those  words  Wj  that  Mj  rejects, 
that  is,  it  contains  row  indices  (words)  for  which  the  word  “reject”  is  found  on  the  diagonal. 
If  we  assume  that  some  TM,  Mk,  accepts  C\,  we  have  a  problem  because  we  cannot  decide 
whether  or  not  wk  is  in  C\ .  Diagonalization  is  effective  in  ruling  out  the  possibility  of  solving 
a  computational  problem  but  has  limited  usefulness  on  problems  of  bounded  size. 

5.7.3  Recursively  Enumerable  but  Not  Decidable  Languages 

We  show  the  existence  of  a  language  that  is  recursively  enumerable  but  not  decidable.  Our 
approach  is  to  show  that  the  complement  of  a  recursive  language  is  recursive  and  then  exhibit 
a  recursively  enumerable  language  £2  whose  complement  C\  is  not  recursively  enumerable: 

£2  =  {u>i  |  Wi  is  accepted  by  Mi} 

THEOREM  5.7.5  The  complement  of  a  decidable  language  is  decidable. 

Proof  Let  £  be  a  recursive  language  accepted  by  a  Turing  machine  -M)  that  halts  on  all 
input  strings.  Relabel  the  accepting  halt  state  of  M\  as  non-accepting  and  all  non-accepting 
halt  states  as  accepting.  This  produces  a  machine  M2  that  enters  an  accepting  halt  state  only 
when  M\  enters  a  non-accepting  halt  state  and  vice  versa.  We  convert  this  non-standard 
machine  to  standard  form  (having  one  accepting  halt  state)  by  adding  a  new  accepting  halt 
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state  and  making  a  transition  to  it  from  all  accepting  halt  states.  This  new  machine  halts  on 
all  inputs  and  accepts  the  complement  of  L.  ■ 

THEOREM  5.7.6  The  language  C2  is  recursively  enumerable  but  not  decidable. 

Proof  To  establish  the  desired  result  it  suffices  to  exhibit  a  Turing  machine  M  that  accepts 
each  string  in  C2,  because  the  complement  C2  =  C\,  which  is  not  recursively  enumerable, 
as  shown  above. 

Given  a  string  a;  in  B* ,  let  M  enumerate  the  input  strings  over  the  alphabet  B  of  C2 
until  it  finds  x.  Let  x  be  the  ith  string  where  i  is  recorded  in  binary  on  one  of  M’s  tapes. 
The  strings  over  the  alphabet  A  used  for  canonical  encodings  of  Turing  machines  are  enu¬ 
merated  and  tested  to  determine  whether  or  not  they  are  canonical  encodings,  as  described 
in  Section  5.6.  When  the  encoding  p{Mi)  of  the  zth  Turing  machine  is  discovered,  Mi  is 
simulated  with  a  universal  Turing  machine  on  the  input  string  x.  This  universal  machine 
will  halt  and  accept  the  string  x  if  it  is  in  C2.  Thus,  C2  is  recursively  enumerable.  ■ 


5.8  Reducibility  and  Unsolvability 

In  this  section  we  show  that  there  are  many  languages  that  are  unsolvable  (undecidable).  In  the 
previous  section  we  showed  that  the  language  C2  is  unsolvable.  To  show  that  a  new  problem 
is  unsolvable  we  use  reducibility:  we  assume  an  algorithm  A  exists  for  a  new  language  L  and 
then  show  that  we  can  use  A  to  obtain  an  algorithm  for  a  language  previously  shown  to  be 
unsolvable,  thereby  contradicting  the  assumption  that  algorithm  A  exists. 

We  begin  by  introducing  reducibility  and  then  give  examples  of  unsolvable  languages. 
Many  interesting  languages  are  unsolvable. 

5.8.1  Reducibility 

A  new  language  £new  can  often  be  shown  unsolvable  by  assuming  it  is  solvable  and  then 
showing  this  implies  that  an  older  language  £0id  is  solvable,  where  /fyiu  has  been  previously 
shown  to  be  unsolvable.  Since  this  contradicts  the  facts,  the  new  language  cannot  be  solvable. 
This  is  one  application  of  reducibility.  The  formal  definition  of  reducibility  is  given  below 
and  illustrated  by  Fig.  5.12. 

DEFINITION  5.8.1  The  language  L\  is  reducible  to  the  language  L2  if  there  is  an  algorithm 
computing  a  total  function  f  :  C*  t— >  T>*  that  translates  each  string  w  over  the  alphabet  C  ofL\ 
into  a  string  z  =  /  (w)  over  the  alphabet  T>  of  L2  such  that  w  £  L\  if  and  only  if  z  £  L2. 

In  this  definition,  testing  for  membership  of  a  string  w  in  L\  is  reduced  to  testing  for 
membership  of  a  string  z  in  L2,  where  the  latter  problem  is  presumably  a  previously  solved 
problem.  It  is  important  to  note  that  the  latter  problem  is  no  easier  than  the  former,  even 
though  the  use  of  the  word  “reduce”  suggests  that  it  is.  Rather,  reducibility  establishes  a  link 
between  two  problems  with  the  expectation  that  the  properties  of  one  can  be  used  to  deduce 
properties  of  the  other.  For  example,  reducibility  is  used  to  identify  NP-complete  problems. 
(See  Sections  3.9.3  and  8.7.) 
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<t>  i(*)  =  fa  (/(*)) 


Figure  5.12  The  characteristic  function  (pi  of  Li,  i  —  1,2  has  value  1  on  strings  in  Li  and 
0  otherwise.  Because  the  language  L\  is  reducible  to  the  language  L2,  there  is  a  function  /  such 
that  for  all  as,  (j>\(x)  =  (p2 (f(x)). 


Reducibility  is  a  fundamental  idea  that  is  formally  introduced  in  Section  2.4  and  used 
throughout  this  book.  Reductions  of  the  type  defined  above  are  known  as  many-to-one  re¬ 
ductions.  (See  Section  8.7  for  more  on  this  subject.) 

The  following  lemma  is  a  tool  to  show  that  problems  are  unsolvable.  We  use  the  same 
mechanism  in  Chapter  8  to  classify  languages  by  their  use  of  time,  space  and  other  computa¬ 
tional  resources. 

LEMMA  5.8.1  Let  L\  be  reducible  to  L2.  If  L2  is  decidable,  then  L\  is  decidable.  If  L\  is 
unsolvable  and  Li  is  recursively  enumerable,  Li  is  also  unsolvable. 

Proof  Let  T  be  a  Turing  machine  implementing  the  algorithm  that  translates  strings  over 
the  alphabet  of  L\  to  strings  over  the  alphabet  of  L2.  If  L2  is  decidable,  there  is  a  halting 
Turing  machine  M2  that  accepts  it.  A  multi-tape  Turing  machine  M\  that  decides  L\  can 
be  constructed  as  follows:  On  input  string  w,  M\  invokes  T  to  generate  the  string  z,  which 
it  then  passes  to  M2.  If  M2  accepts  z,  M\  accepts  w.  If  M2  rejects  it,  so  does  M\.  Thus, 
M\  decides  L\. 

Suppose  now  that  L\  is  unsolvable.  Assuming  that  L2  is  decidable,  from  the  above  con¬ 
struction,  L\  is  decidable,  contradicting  this  assumption.  Thus,  L2  cannot  be  decidable.  ■ 

The  power  of  this  lemma  will  be  apparent  in  the  next  section. 

5.8.2  Unsolvable  Problems 

In  this  section  we  examine  six  representative  unsolvable  problems.  They  range  from  the  classi¬ 
cal  halting  problem  to  Rice’s  theorem. 

We  begin  by  considering  the  halting  problem  for  Turing  machines.  The  problem  is  to 
determine  for  an  arbitrary  TM  M  and  an  arbitrary  input  string  x  whether  M  with  input  x 
halts  or  not.  We  characterize  this  problem  by  the  language  Ch  shown  below.  We  show  it  is 
unsolvable,  that  is,  Ch  is  recursively  enumerable  but  not  decidable.  No  Turing  machine  exists 
to  decide  this  language. 


Ch  =  {p(-M),  w  |  M  halts  on  input  w} 
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THEOREM  5.8. 1  The  language  Ch  is  recursively  enumerable  but  not  decidable. 

Proof  To  show  that  Ch  is  recursively  enumerable,  pass  the  encoding  p(M)  of  the  TM  M 
and  the  input  string  w  to  the  universal  Turing  machine  U  of  Section  5.5.  This  machine 
simulates  M  and  halts  on  the  input  w  if  and  only  if  M  halts  on  w.  Thus,  Ch  is  recursively 
enumerable. 

To  show  that  Ch  is  undecidable,  we  assume  that  Ch  is  decidable  by  a  Turing  machine 
Mh  and  show  a  contradiction.  Using  Mh  we  construct  a  Turing  machine  M*  that  decides 
the  language  C*  =  {p{M),  w  \  w  is  not  accepted  by  M}.  M*  simulates  Mh  on  p(M),  w 
to  determine  whether  M  halts  or  not  onto.  If  Mh  says  that  M  does  not  halt,  M*  accepts 
w.  If  Mh  says  that  M  does  halt,  M*  simulates  M  on  input  string  w  and  rejects  w  if  M 
accepts  it  and  accepts  w  if  M  rejects  it.  Thus,  if  Ch  is  decidable,  so  is  C* . 

The  procedures  described  in  Section  5.6  can  be  used  to  design  a  Turing  machine  M * 
that  determines  for  which  integer  i  the  input  string  w  is  lexicographically  the  ith  string,  W,  , 
and  also  produce  the  description  p(M,;  )  of  the  ith  Turing  machine  Mt . 

To  decide  C\  we  use  M *  to  translate  an  input  string  w  =  Wi  to  the  string  p(Mi),  Wi. 
Given  the  presumed  existence  of  M* ,  we  can  decide  C\  by  deciding  C* .  However,  by 
Theorem  5.7.4,  C\  is  not  decidable  (it  is  not  even  recursively  enumerable).  Thus,  C*  is  not 
decidable  which  implies  that  Ch  is  also  not  decidable.  ■ 

The  second  unsolvable  problem  we  consider  is  the  empty  tape  acceptance  problem:  given 
a  Turing  machine  M,  we  ask  if  we  can  tell  whether  it  accepts  the  empty  string.  We  reduce  the 
halting  problem  to  it.  (See  Fig.  5.13.) 

£et  =  {p(M)  |  L(M)  contains  the  empty  string} 

THEOREM  5.8.2  The  language  £et  is  not  decidable. 

Proof  To  show  that  £et  is  not  decidable,  we  assume  that  it  is  and  derive  a  contradiction. 
The  contradiction  is  produced  by  assuming  the  existence  of  a  TM  Met  that  decides  £et 
and  then  showing  that  this  implies  the  existence  of  a  TM  Mh  that  decides  £h- 

Given  an  encoding  p(M)  for  an  arbitrary  TM  M  and  an  arbitrary  input  w,  the  TM 
Mh  constructs  a  TM  T(M,w )  that  writes  w  on  the  tape  when  the  tape  is  empty  and 
simulates  M  on  w,  halting  if  M  halts.  Thus,  T(M,  w)  accepts  the  empty  tape  if  M  halts 
on  w.  Mh  decides  £h  by  constructing  an  encoding  of  T{M,  w)  and  passing  it  to  Met- 
(See  Fig.  5.13.)  The  language  accepted  by  T{M,  w )  includes  the  empty  string  if  and  only 


p(M) 


Figure  5. 1  3  Schematic  representation  of  the  reduction  from  Ch  to  £et. 
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if  M  halts  on  w.  Thus,  Mr  decides  the  halting  problem,  which  as  shown  earlier  cannot  be 
decided.  ■ 

The  third  unsolvable  problem  we  consider  is  the  empty  set  acceptance  problem:  Given  a 
Turing  machine,  we  ask  if  we  can  tell  if  the  language  it  accepts  is  empty.  We  reduce  the  halting 
problem  to  this  language. 

Ml  =  MM)  I  L(M)  =  0} 

THEOREM  5.8.3  The  language  £el  is  not  decidable. 

Proof  We  reduce  Ch  to  Ml>  assume  that  Ml  is  decidable  by  a  TM  Mel>  and  then  show 
that  a  TM  Mr  exists  that  decides  Cr,  thereby  establishing  a  contradiction. 

Given  an  encoding  p(M)  for  an  arbitrary  TM  M  and  an  arbitrary  input  w,  the  TM 
Mh  constructs  a  TM  T(M,  w)  that  accepts  the  string  placed  on  its  tape  if  it  is  w  and  M 
halts  on  it;  otherwise  it  enters  an  infinite  loop.  Mh  can  implement  T(M,  w)  by  entering  an 
infinite  loop  if  its  input  string  is  not  w  and  otherwise  simulating  M  on  w  with  a  universal 
Turing  machine. 

It  follows  that  L(T(M,  w))  is  empty  if  M  does  not  halt  on  w  and  contains  w  if  it  does 
halt.  Under  the  assumption  that  Mel  decides  Ml>  M#  can  decide  Ch  by  constructing 
T(M,  w )  and  passing  it  to  Mel>  which  accepts  p(T(M,  w ))  if  M  does  not  halt  on  w  and 
rejects  it  if  M  does  halt.  Thus,  Mh  decides  Ch,  a  contradiction.  ■ 

The  fourth  problem  we  consider  is  the  regular  machine  recognition  problem.  In  this 
case  we  ask  if  a  Turing  machine  exists  that  can  decide  from  the  description  of  an  arbitrary 
Turing  machine  M  whether  the  language  accepted  by  M  is  regular  or  not: 

Cr  =  {p(M)  |  L(M)  is  regular} 

THEOREM  5.8.4  The  language  Cr  is  not  decidable. 

Proof  We  assume  that  a  TM  Mr  exists  to  decide  Cr  and  show  that  this  implies  the  exis¬ 
tence  of  a  TM  Mr  that  decides  Ch,  a  contradiction.  Thus,  Mr  cannot  exist. 

Given  an  encoding  p(M)  for  an  arbitrary  TM  M  and  an  arbitrary  input  w,  the  TM 
Mr  constructs  a  TM  T(M,  w)  that  scans  its  tape.  If  it  finds  a  string  in  {0"1"  |  n  >  0},  it 
accepts  it;  if  not,  T(M,  w)  erases  the  tape  and  simulates  M  on  w,  halting  only  if  M  halts 
on  w.  Thus,  T(M,  w )  accepts  all  strings  in  B*  if  M  halts  on  w  but  accepts  only  strings 
in  {0"ln  |  n  >  0}  otherwise.  Thus,  T(M,  w )  accepts  the  regular  language  B*  if  M  halts 
on  w  and  accepts  the  context-free  language  {0nln  |  n  >  0}  otherwise.  Thus,  Mh  can  be 
implemented  by  constructing  T(M,  w)  and  passing  it  to  Mr,  which  is  presumed  to  decide 
Cr.  ■ 

The  fifth  problem  generalizes  the  above  result  and  is  known  as  Rice’s  theorem.  It  says  that 
no  algorithm  exists  to  determine  from  the  description  of  a  TM  whether  or  not  the  language  it 
accepts  falls  into  any  proper  subset  of  the  recursively  enumerable  languages. 

Let  RE  be  the  set  of  recursively  enumerable  languages  over  B.  For  each  set  C  that  is  a 
proper  subset  of  RE,  define  the  following  language: 

Cc  =  {p(M)  |  L(M)  e  C} 

Rice’s  theorem  says  that,  for  all  C  such  that  C  yf  0  and  C  C  RE,  the  language  Cc  defined  above 
is  undecidable. 
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THEOREM  5.8.5  (Rice)  LetC  C  RE,  C  yf  0.  The  language  Cc  is  not  decidable. 

Proof  To  prove  that  Cc  is  not  decidable,  we  assume  that  it  is  decidable  by  the  TM  Me  and 
show  that  this  implies  the  existence  of  a  TM  Mh  that  decides  Ch ,  which  has  been  shown 
previously  not  to  exist.  Thus,  Mq  cannot  exist. 

We  consider  two  cases,  the  first  in  which  B*  is  in  not  C  and  the  second  in  which  it  is  in 
C.  In  the  first  case,  let  L  be  a  language  in  C.  In  the  second,  let  L  be  a  language  in  RE  —  C. 
Since  C  is  a  proper  subset  of  RE  and  not  empty,  there  is  always  a  language  L  such  that  one 
of  L  and  B*  is  in  C  and  the  other  is  in  its  complement  RE  —  C. 

Given  an  encoding  p(M)  for  an  arbitrary  TM  M  and  an  arbitrary  input  w,  the  TM 
Mh  constructs  a  (four-tape)  TM  T(M,  w)  that  simulates  two  machines  in  parallel  (by  al¬ 
ternatively  simulating  one  step  of  each  machine).  The  first,  Mo,  uses  a  phrase-structure 
grammar  for  L  to  see  if  T(M,  w)’s  input  string  x  is  in  L\  it  holds  x  on  one  tape,  holds  the 
current  choice  inputs  for  the  NDTM  Ml  of  Theorem  5.4.2  on  a  second,  and  uses  a  third 
tape  for  the  deterministic  simulation  of  Ml ■  (See  the  comments  following  Theorem  5.4.2.) 
T(M,  w)  halts  if  Md  generates  x.  The  second  TM  writes  w  on  the  fourth  tape  and  sim¬ 
ulates  M  on  it.  T(M,w)  halts  if  M  halts  on  w.  Thus,  T(M,w)  accepts  the  regular 
language  B*  if  M  halts  on  w  and  accepts  L  otherwise.  Thus,  Mh  can  be  implemented  by 
constructing  T(M,  w)  and  passing  it  to  Me,  which  is  presumed  to  decide  Cc  ■  ■ 

Our  last  problem  is  the  self-terminating  machine  problem.  The  question  addressed  is 
whether  a  Turing  machine  M  given  a  description  p(M)  of  itself  as  input  will  halt  or  not.  The 
problem  is  defined  by  the  following  language.  We  give  a  direct  proof  that  it  is  undecidable; 
that  is,  we  do  not  reduce  some  other  problem  to  it. 

CSt  =  {p{M)  |  M  is  self-terminating} 

THEOREM  5.8.6  The  language  £gx  is  recursively  enumerable  but  not  decidable. 

Proof  To  show  that  £gx  is  recursively  enumerable  we  exhibit  a  TM  T  that  accepts  strings 
in  £gT.  T  makes  a  copy  of  its  input  string  p{M)  and  simulates  M  on  p{M)  by  passing 
(p(M),  p(M))  to  a  universal  TM  that  halts  and  accepts  p(M)  if  it  is  in  £gx- 

To  show  that  Egx  is  not  decidable,  we  assume  that  it  is  and  arrive  at  a  contradiction. 
Let  Mgx  decide  £gx-  We  design  a  TM  M*  that  does  the  following:  M*  simulates  Mgx  on 
the  input  string  w.  If  Mgx  halts  and  accepts  w,  M*  enters  an  infinite  loop.  If  Mgx  halts 
and  rejects  w,  M*  accepts  w.  (Mgx  halts  on  all  inputs.) 

The  new  machine  M*  is  either  self-terminating  or  it  is  not.  If  M*  is  self-terminating, 
then  on  input  p(M*),  which  is  an  encoding  of  itself,  M*  enters  an  infinite  loop  because 
Mgx  detects  that  it  is  self-terminating.  Thus,  M*  is  not  self-terminating.  On  the  other 
hand,  if  M*  is  not  self-terminating,  on  input  p(M*)  it  halts  and  accepts  p{M*)  because 
Mgx  detects  that  it  is  not  self-terminating  and  enters  the  rejecting  halt  state.  But  this  con¬ 
tradicts  the  assumption  that  M*  is  not  self-terminating.  Since  we  arrive  at  a  contradiction 
in  both  cases,  the  assumption  that  £gx  is  decidable  must  be  false.  ■ 

5.9  Functions  Computed  by  Turing  Machines 

In  this  section  we  introduce  the  partial  recursive  functions,  a  family  of  functions  in  which 
each  function  is  constructed  from  three  basic  function  types,  zero,  successor,  and  projection, 
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and  three  operations  on  functions,  composition,  primitive  recursion,  and  minimalization.  Al¬ 
though  we  do  not  have  the  space  to  show  this,  the  functions  computed  by  Turing  machines  are 
exactly  the  partial  recursive  functions.  In  this  section,  we  show  one  half  of  this  result,  namely, 
that  every  partial  recursive  function  can  be  encoded  as  a  RAM  program  (see  Section  3.4.3)  that 
can  be  executed  by  Turing  machines. 

We  begin  with  the  primitive  recursive  functions  then  describe  the  partial  recursive  func¬ 
tions.  We  then  show  that  partial  recursive  functions  can  be  realized  by  RAM  programs. 

5.9.1  Primitive  Recursive  Functions 

Let  IN  =  {0,  1,  2,  3, . . .}  be  the  set  of  non-negative  integers.  The  partial  recursive  functions, 
/  :  ]Nn  i — >  ]Nm,  map  n-tuples  of  integers  over  ]N  to  TO-tuples  of  integers  in  IN  for  arbitrary 
n  and  to.  Partial  recursive  functions  may  be  partial  functions.  They  are  constructed  from 
three  base  function  types,  the  successor  function  S  :  IN  i— >  IN,  where  S(x )  =  x  +  1, 
the  predecessor  function  P  :  IN  i— >  IN,  where  P(x)  returns  either  0  if  a;  =  0  or  the 
integer  one  less  than  x,  and  the  projection  functions  U"  :  IN”  i — >  IN,  1  <  j  <  n,  where 
Uf(x\,x2,.  ■  -  ,xn)  =  Xj.  These  basic  functions  are  combined  using  a  finite  number  of 
applications  of  function  composition,  primitive  recursion,  and  minimalization. 

Function  composition  is  studied  in  Chapters  2  and  6.  A  function  /  :  IN"  i— >  IN  of  n 
arguments  is  defined  by  the  composition  of  a  function  g  :  IN7"  i— >  IN  of  to  arguments  with 
to  functions  f\  :  IN71  i— >  IN,  f2  :  IN”  i— >  IN,  . . . ,  fm  :  IN"  i— >  IN,  each  of  n  arguments,  as 
follows: 

f{x  i,  x2,...,  xn)  =  g(fi  (xi,  x2,  •  •  • ,  xn), . . . ,  fm{x  i,x2,  ...,xn)) 

A  function  /  :  IN”+1  i— >  IN  of  n  +  1  arguments  is  defined  by  primitive  recursion  from  a 
function  g  :  IN"  i->  N  ofn  arguments  and  a  function  h  :  IN"+2  i— >  IN  on  n  +  2  arguments 
if  and  only  if  for  all  values  of  Xu  x2, . . . ,  xn  and  y  in  IN: 

f(x  i,x2,  ...,xn,0)  =  g(xi,x2,  ...,x„) 
f(x\,  x2, . .  ,,xn,y+  1)  =  h(x  ux2, . . .  ,xn,y,  f(xux2, . .  .,xn,yj) 

In  the  above  definition  if  n  =  0,  we  adopt  the  convention  that  the  value  of  /  is  a  constant. 
Thus,  f(x i,  x2, . . . ,  xn,  k)  is  defined  recursively  in  terms  of  h  and  itself  with  k  replaced  by 
k  —  1  unless  k  =  0. 

DEFINITION  5.9. 1  The  class  of  primitive  recursive  functions  is  the  smallest  class  of  functions 
that  contains  the  base  functions  and  is  closed  under  composition  and  primitive  recursion. 

Many  functions  of  interest  are  primitive  recursive.  Among  these  is  the  zero  function 
Z  :  IN  i— >  IN,  where  Z{x)  =  0.  It  is  defined  by  primitive  recursion  by  Z( 0)  =  0  and 

Z{x  +  1)  =  U2(x,  Z(x)) 

Other  important  primitive  recursive  functions  are  addition,  subtraction,  multiplication,  and 
division,  as  we  now  show.  Let  /add  :  IN2  i->  IN,  fsuh  :  IN2  i->  IN,  /muit  :  IN2  i->  IN,  and 
/div  :  IN2  i— >  IN  denote  integer  addition,  subtraction,  multiplication,  and  division. 

For  the  integer  addition  function  /a dd  introduce  the  function  hi  :  IN2  i— >  IN  on  three 
arguments,  where  hi  is  defined  below  in  terms  of  the  successor  and  projection  functions: 

h  i(xi,x2,x})  =  S{Ul(x \,x2,Xi)) 
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Then,  h\{x\,X2,x 3)  =  £3  +  1.  Now  define  fadd(x,y)  using  primitive  recursion,  as  follows: 

/add(z>  0)  =  U\{x) 
fadd{x,  y  +  1)  =  h\(x,  y,  fadd(x,  y)) 

The  role  of  h  is  to  carry  the  values  of  x  and  y  from  one  recursive  invocation  to  another.  To 
determine  the  value  of  fadd{x,y)  from  this  definition,  if  y  =  0,  fadd(x,y)  =  x.  If  y  >  0, 
/add  ( x,y )  =  hi  (x,  y  —  1 ,  /a dd  (x,  y  —  1 ) ) .  This  in  turn  causes  other  recursive  invocations  of 
/add-  The  infix  notation  +  is  used  for  fadd;  that  is,  fadd{x,  y)  =  x  +  y. 

Because  the  primitive  recursive  functions  are  defined  over  the  non-negative  integers,  the 
subtraction  function  fSub{x,y )  must  return  the  value  0  if  y  is  larger  than  x,  an  operation 
called  proper  subtraction.  (Its  infix  notation  is  -  and  we  write  fsub(x,  y)  =  x  —  y.)  It  is 
defined  as  follows: 


/sub(z>0)  =  17/(0!) 

fsub{x,  y  +  1)  =  E/f  (x,  y,  P(f Sub{x,  y))) 

The  value  of  fsub(x,  y)  is  x  if  y  =  0  and  is  the  predecessor  of  fsub(x,  y  —  1)  otherwise. 

The  integer  multiplication  function,  /mult>  is  defined  in  terms  of  the  function  h2  : 
IN3  IN: 


h2( xux2,x})  =  fadd{Ui{xi,x2,x}),  Ul(x\,x2,X})) 

Using  primitive  recursion,  we  have 

/mult  {x,  0)  =  Z(x*) 

/mult  (x,  y  +  1)  =  h2(x,y,  fmu \t(x,  y)) 

The  value  of  /m ult  ( x ,  y)  is  zero  if  y  =  0  and  otherwise  is  the  result  of  adding  x  to  itself  y 
times.  To  see  this,  note  that  the  value  of  h2  is  the  sum  of  its  first  and  third  arguments,  x  and 
/mult  ( x ,  y) .  On  each  invocation  of  primitive  recursion  the  value  of  y  is  decremented  by  1 
until  the  value  0  is  reached.  The  definition  of  the  division  function  is  left  as  Problem  5.26. 

Define  the  function  /sign  :  IN  IN  so  that  fs ign(0)  =  0  and  fsign(x  +1)  =  1.  To 
show  that  fsign  is  primitive  recursive  it  suffices  to  invoke  the  projection  operator  formally.  A 
function  with  value  0  or  1  is  called  a  predicate. 

5.9.2  Partial  Recursive  Functions 

The  partial  recursive  functions  are  obtained  by  extending  the  primitive  recursive  functions  to 
include  minimalization.  Minimalization  defines  a  function  /  :  IN”  1— »  IN  in  terms  of  a 
second  function  g  :  N”+1  1— >  IN  by  letting  f(x)  be  the  smallest  integer  y  £  IN  such  that 
g(x,  y)  =  0  and  g(x,  z)  is  defined  for  all  z  <  y,  z  £  IN.  Note  that  if  g(x,  z)  is  not  defined 
for  all  z  <  y,  then  f(x)  is  not  defined.  Thus,  minimalization  can  result  in  partial  functions. 

DEFINITION  5.9.2  The  set  o/partial  recursive  functions  is  the  smallest  set  of  functions  contain¬ 
ing  the  base  functions  that  is  closed  under  composition,  primitive  recursion,  and  minimalization. 

A  partial  recursive  function  that  is  defined  for  all  points  in  its  domain  is  called  a  recursive 
function. 
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5.9.3  Partial  Recursive  Functions  are  RAM-Computable 

There  is  a  nice  correspondence  between  RAM  programs  and  partial  recursive  functions.  The 
straight-line  programs  result  from  applying  composition  to  the  base  functions.  Adding  primi¬ 
tive  recursion  corresponds  to  adding  for-loops  whereas  adding  minimilization  corresponds  to 
adding  while  loops. 

It  is  not  difficult  to  see  that  every  partial  recursive  function  can  be  described  by  a  program 
in  the  RAM  assembly  language  of  Section  3.4.3.  For  example,  to  compute  the  zero  function, 
Z(x),  it  suffices  for  a  RAM  program  to  clear  register  Ri.  To  compute  the  successor  function, 
S(x),  it  suffices  to  increment  register  Rj.  Similarly,  to  compute  the  projection  function  [/”, 
one  need  only  load  register  Ri  with  the  contents  of  register  Rj .  Function  composition  it  is 
straightforward:  one  need  only  insure  that  the  functions  fj,  1  <  j  <  TO,  deposit  their  values 
in  registers  that  are  accessed  by  g.  Similar  constructions  are  possible  for  primitive  recursion 
and  minimalization.  (See  Problems  5.29,  5.30,  and  5.31.) 


Problems 

THE  STANDARD  TURING  MACHINE  MODEL 

5.1  Show  that  the  standard  Turing  machine  model  of  Section  5.1  and  the  model  of  Sec¬ 
tion  3.7  are  equivalent  in  that  one  can  simulate  the  other. 

PROGRAMMING  THE  TURING  MACHINE 

5.2  Describe  a  Turing  machine  that  generates  the  binary  strings  in  lexicographical  order. 
The  first  few  strings  in  this  ordering  are  0,  1,  00,  01,  10,  11,  000,  001,  ... . 

5.3  Describe  a  Turing  machine  recognizing  {xly^xk  \  i,j,  k  >  1  and  k  =  i  ■  j}. 

5.4  Describe  a  Turing  machine  that  computes  the  function  whose  value  on  input  is 
ck,  where  k  =  i  ■  j. 

5-5  Describe  a  Turing  machine  that  accepts  the  string  (u,  v)  if  it  is  a  substring  of  v. 

5.6  The  element  distinctness  language,  Le( j,  consists  of  binary  strings  no  two  of  which 
are  the  same;  that  is,  Led  =  {2w\2  .  . .  2w^2  \  Wi  £  B*  and  Wi  yf  Wj,  for  i  yf  j}. 
Describe  a  Turing  machine  that  accepts  this  language. 

EXTENSIONS  TO  THE  STANDARD  TURING  MACHINE  MODEL 

5.7  Given  a  Turing  machine  with  a  double-ended  tape,  show  how  it  can  be  simulated  by 
one  with  a  single-ended  tape. 

5.8  Show  equivalence  between  the  standard  Turing  machine  and  the  one- tape  double¬ 
headed  Turing  machine  with  two  heads  that  can  move  independently  on  its  one  tape. 

5.9  Show  that  a  pushdown  automaton  with  two  pushdown  tapes  is  equivalent  to  a  Turing 
machine. 

5.10  Figure  5.14  shows  a  representation  of  a  Turing  machine  with  a  two-dimensional  tape 
whose  head  can  move  one  step  vertically  or  horizontally.  Give  a  complete  definition  of 
a  two-dimensional  TM  and  sketch  a  proof  that  it  can  be  simulated  by  a  standard  TM. 
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Figure  5.14  A  schematic  representation  of  a  two-dimensional  Turing  machine. 


5.1 1  By  analogy  with  the  construction  given  in  Section  3.9.7,  show  that  every  deterministic 
T-step  multi-tape  Turing  machine  computation  can  be  simulated  on  a  two-tape  Turing 
machine  in  0(T  log  T)  steps. 

PHRASE-STRUCTURE  LANGUAGES  AND  TURING  MACHINES 

5.12  Give  a  detailed  design  of  a  Turing  machine  recognizing  {anbncn  \  n  >  1}. 

5.13  Use  the  method  of  Theorem  5.4.1  to  construct  a  phrase-structure  grammar  generating 
{anbncn  |  n  >  1}. 

5.14  Design  a  Turing  machine  recognizing  the  language  {02  i  >  1}. 

UNIVERSAL  TURING  MACHINES 

5.15  Using  the  description  of  Section  5.5,  give  a  complete  description  of  a  universal  Turing 
machine. 

5.16  Construct  a  universal  TM  that  has  only  two  non-accepting  states. 

DECIDABLE  PROBLEMS 

5.17  Show  that  the  following  languages  are  decidable: 

a)  L  =  {p(M),  w  j  M  is  a  DFSM  that  accepts  the  input  string  it?} 

b)  L  =  {p(M)  |  M  is  a  DFSM  and  L(M)  is  infinite} 

5.18  The  symmetric  difference  between  sets  A  and  B  is  defined  by  (A  -  B)  IJ  (B  A), 
where  A  —  B  =  A  D  B.  Use  the  symmetric  difference  to  show  that  the  following 
language  is  decidable: 

£eq_fsm  =  {p(M i),  p(M2 )  |  Mi  and  M2  are  FSMs  recognizing  the  same  language} 
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5.19  Show  that  the  following  language  is  decidable: 

L  =  {p(G),  w  j  p(G)  encodes  a  CFG  G  that  generates  ur} 

Hint:  How  long  is  a  derivation  of  w  if  G  is  in  Chomsky  normal  form? 

5.20  Show  that  the  following  language  is  decidable: 

L  =  {p(G)  |  p(G)  encodes  a  CFG  G  for  which  L(G)  yf  0} 

5.21  Let  L\,  L2  £P  where  P  is  the  class  of  polynomial-time  problems  (see  Definition  3.7.2). 
Show  that  the  following  statements  hold: 

a)  Li  U  L2  e  P 

b)  L\L2  £  P,  where  L1L2  is  the  concatenation  of  L\  and  L2 

c)  Li  £P 

5.22  Let  Li  G  P.  Show  that  L\  £  P. 

Hint:  Try  using  dynamic  programming,  the  algorithmic  concept  illustrated  by  the 
parsing  algorithm  of  Theorem  4.1 1.2. 

UNSOLVABLE  PROBLEMS 

5.23  Show  that  the  problem  of  determining  whether  an  arbitrary  TM  starting  with  a  blank 
tape  will  ever  halt  is  unsolvable. 

5.24  Show  that  the  following  language  is  undecidable: 

Leq  =  {p(Mi),  p(M2)  I  L{Mi)  =  L(M2)} 

5.25  Determine  which  of  the  following  problems  are  solvable  and  unsolvable.  Defend  your 
conclusions. 

a)  {p(M),w,p  |  M  reaches  state  p  on  input  w  from  its  initial  state} 

b)  {p(M),p  |  there  is  a  configuration  [mj  . . .  umqv  1  . . .  un]  yielding  a  configuration 
containing  state  p} 

c)  {p(M),  a  |  M  writes  character  a  when  started  on  the  empty  tape} 

d)  {p(M)  |  M  writes  a  non-blank  character  when  started  on  the  empty  tape} 

e)  {p(M),w  |  on  input  w  M  moves  its  head  to  the  left} 

FUNCTIONS  COMPUTED  BY  TURING  MACHINES 

5.26  Define  the  integer  division  function  /div  :  IN2  1— »  IN  using  primitive  recursion. 

5.27  Show  that  the  function  /remain  :  IN2  >  IN  that  provides  the  remainder  of  x  after 
division  by  y  is  a  primitive  recursive  function. 

5.28  Show  that  the  factorial  function  x\  is  primitive  recursive. 

5.29  Write  a  RAM  program  (see  Section  3.4.3)  to  realize  the  composition  operation. 

5.30  Write  a  RAM  program  (see  Section  3.4.3)  to  realize  the  primitive  recursion  operation. 

5.31  Write  a  RAM  program  (see  Section  3.4.3)  to  realize  the  minimalization  operation. 
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Chapter  Notes 

Alan  Turing  introduced  the  Turing  machine,  gave  an  example  of  a  universal  machine  and 
demonstrated  the  unsolvability  of  the  halting  problem  in  [338].  A  similar  model  was  inde¬ 
pendently  developed  by  Post  [255].  Chomsky  [69]  demonstrated  the  equivalence  of  phrase- 
structure  languages.  Rice’s  theorem  is  presented  in  [280]. 

Church  gave  a  formal  model  of  computation  in  [72],  The  equivalence  between  the  partial 
recursive  functions  and  the  Turing  computable  functions  was  shown  by  Kleene  [168]. 

For  a  more  extensive  introduction  to  Turing  machines,  see  the  books  by  Hopcroft  and 
Ullman  [141]  and  Lewis  and  Papadimitriou  [200], 


Algebraic  and  Combinatorial 

Circuits 


Algebraic  circuits  combine  operations  drawn  from  an  algebraic  system.  In  this  chapter  we  de¬ 
velop  algebraic  and  combinatorial  circuits  for  a  variety  of  generally  non-Boolean  problems,  in¬ 
cluding  multiplication  and  inversion  of  matrices,  convolution,  the  discrete  Fourier  transform, 
and  sorting  networks.  These  problems  are  used  primarily  to  illustrate  concepts  developed  in 
later  chapters,  so  that  this  chapter  may  be  used  for  reference  when  studying  those  chapters. 

For  each  of  the  problems  examined  here  the  natural  algorithms  are  straight-line  and  the 
graphs  are  directed  and  acyclic;  that  is,  they  are  circuits.  Not  only  are  straight-line  algorithms 
the  ones  typically  used  for  these  problems,  but  in  some  cases  they  are  the  best  possible. 

The  quality  of  the  circuits  developed  here  is  measured  by  circuit  size,  the  number  of  circuit 
operations,  and  circuit  depth,  the  length  of  the  longest  path  between  input  and  output  ver¬ 
tices.  Circuit  size  is  a  measure  of  the  work  necessary  to  execute  the  corresponding  straight-line 
program.  Circuit  depth  is  a  measure  of  the  minimal  time  needed  for  a  problem  on  a  parallel 
machine. 

For  some  problems,  such  as  matrix  inversion,  we  give  serial  (large-depth)  as  well  as  par¬ 
allel  (small-depth)  circuits.  The  parallel  circuits  generally  require  considerably  more  circuit 
elements  than  the  corresponding  serial  circuits. 
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6.1  Straight-Line  Programs 

Straight-line  programs  (SLP)  are  defined  in  Section  2.2.  Each  SLP  step  is  an  input,  compu¬ 
tation,  or  output  step.  The  notation  (s  READ  x)  indicates  that  the  sth  step  is  an  input  step  on 
which  the  value  x  is  read.  The  notation  (s  OUTPUT  i)  indicates  that  the  result  of  the  7th  step 
is  to  be  provided  as  output.  Finally,  the  notation  (s  OP  !  ...  k)  indicates  that  the  sth  step 
computes  the  value  of  the  operator  OP  on  the  results  generated  at  steps  i, . . .  ,k.  We  require 
that  s  >  i, . . . ,  k  so  that  the  result  produced  at  step  s  depends  only  on  the  results  produced 
at  earlier  steps.  In  this  chapter  we  consider  SLPs  in  which  the  inputs  and  operators  have  values 
over  a  set  A  that  is  generally  not  binary.  Thus,  the  circuits  considered  here  are  generally  not 
logic  circuits.  The  basis  12  for  an  SLP  is  the  set  of  operators  it  uses.  A  circuit  is  the  graph  of  a 
straight-line  program.  By  its  nature  this  graph  is  directed  and  acyclic. 

An  example  of  a  straight-line  program  that  computes  the  fast  Fourier  transform  (FFT) 
on  four  inputs  is  given  below.  (The  FFT  is  introduced  in  Section  6.7.3.)  Here  the  function 
/+,  a  (a,  b)  =  a  +  ba  where  a  is  a  power  of  a  constant  ui  that  is  a  principal  nth  root  of  unity  of 
a  commutative  ring  1Z.  (See  Section  6.7.1.)  The  arguments  a  and  b  are  variables  with  values 
in  TZ. 
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The  graph  of  the  above  SLP  is  the  familiar  FFT  butterfly  graph  shown  in  Fig.  6.1.  As¬ 
signment  statements  are  associated  with  vertices  of  in-degree  zero  and  operator  statements  are 
associated  with  other  vertices.  We  attach  the  name  of  the  operator  or  variable  associated  with 
each  step  to  the  corresponding  vertex  in  the  graph.  We  often  suppress  the  unique  indices  of 
vertices,  although  they  are  retained  in  Fig.  6. 1 . 


9  10  11  12 


ao  02  Oi 


Figure  6. 1  The  FFT  butterfly  graph  on  four  inputs. 
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The  function  gs  is  associated  with  the  sth  step.  The  identity  function  with  value  v  is 
associated  with  the  assignment  statement  (r  READ  v) .  Associated  with  the  computation  step 
(s  OP  i  ...  k)  is  the  function  gs  =  OP©, . . .  ,gu),  where  g%,...,gk  are  the  functions 
computed  at  the  steps  on  which  the  sth  step  depends.  If  a  straight-line  program  has  n  inputs 
and  to  outputs,  it  computes  a  function  /  :  An  i— >  Am.  If  Sj,  S2,  . . .,  sm  are  the  output  steps, 
then  /  =  (gSl,gS2, . . .  ,gSm)-  The  function  computed  by  a  circuit  is  the  function  computed 
by  the  corresponding  straight-line  program. 

In  the  example  above,  gn  =  /+,  fly)  =  55  +  where  g5  =  f+,u*{gi,g2)  = 
ao  +  0,207°  =  00  +  02  and  g-j  =  /+>  wo©,  <74)  =  oi  +  030;°  =  Oi  +  o3.  Thus, 

gw  =  Oo  +  +  02  +  o3oj2 

which  is  the  value  of  the  polynomial  p(x)  at  x  =  0+  when  to4  =  1: 

p(ir)  =  ao  +  ojir  +  02a;2  +  a3a:3 

The  size  of  a  circuit  is  the  number  of  operator  statements  it  contains.  Its  depth  is  the 
length  of  (number  of  edges  on)  the  longest  path  from  an  input  to  an  output  vertex.  The  basis 
f l  is  the  set  of  operators  used  in  the  circuit.  The  size  and  depth  of  the  smallest  and  shallowest 
circuits  for  a  function  /  over  the  basis  f l  are  denoted  Cq(/)  and  Da(f),  respectively.  In  this 
chapter  we  derive  upper  bounds  on  the  size  and  depth  of  circuits. 

6.2  Mathematical  Preliminaries 

In  this  section  we  introduce  rings,  fields  and  matrices,  concepts  widely  used  in  this  chapter. 

6.2.1  Rings  and  Fields 

Rings  and  fields  are  algebraic  systems  that  consists  of  a  set  with  two  special  elements,  0  and  1 , 
and  two  operations  called  addition  and  multiplication  that  obey  a  small  set  of  rules. 

DEFINITION  6.2. 1  A  ring  R  is  a  five-tuple  (R,  +,  *,  0, 1),  where  R  is  closed  under  addition 
+  and  multiplication  *  (that  is,  +  :  R2  1— >  R  and  *  :  R2  1— >  R)  and  +  and  *  are  associative 
(for  all  a,b,c  £  R,  a  +  (b  +  c)  =  (a  +  b)  +  c  and  a  *  (b  *  c)  =  (a  *  b)  *  c).  Also,  0, 1  G  R, 
where  0  is  the  identity  under  addition  (for  all  a  £  1 1,  a  +  0  =  ()  |  -  a  =  a)  and  1  is  the  identity 
under  multiplication  (for  all  a  £  R,  a  *  l  =  1  *  a  =  a).  In  addition,  0  is  an  annihilator 
under  multiplication  (for  all  a  £  R,  a  *  0  =  0  *  a  =  0).  Every  element  of  R  has  an  additive 
inverse  (for  all  a  £  R,  there  exists  an  element  —a  such  that  (— a)  +  a  =  a  +  (— a)  =  0).  Finally, 
addition  is  commutative  (for  all  a,b  £  R,  a  +  b  =  b  +  a)  and  multiplication  distributes  over 
addition  (for  all  a,b,c  £  R,  a*(b  +  c)  =  (a  *  6)  +  (a  *  c)  and  ( b+c)*a=  (b*  a)  +  (c*  a)). 
A  ring  is  commutative  if  multiplication  is  commutative  (for  all  a,b  £  R,  a  *  b  =  b  *  a).  A  field 
is  a  commutative  ring  in  which  each  element  other  than  0  has  a  multiplicative  inverse  (for  all 
a  £  R,  a  f  0,  there  exists  an  element  o_1  such  that  a  *  a-1  =  1). 

Let  7L  be  the  set  of  positive  and  non-negative  integers  and  let  +  and  *  denote  integer 
addition  and  multiplication.  Then  ( 7L ,  +,  *,  0, 1)  is  a  commutative  ring.  (See  Problem  6.1.) 
Similarly,  the  system  ({0,  1},  +,  *,  0,  1),  where  +  is  addition  modulo  2  (for  all  a,b  £  {0,  1}, 
a  +  b  is  the  remainder  after  division  by  2  or  the  EXCLUSIVE  OR  operation)  and  *  is  the  AND 
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operation,  is  a  commutative  ring,  as  the  reader  can  show.  A  third  commutative  ring  is  the 
integers  modulo  p  together  with  the  operations  of  addition  and  multiplication  modulo  p.  (See 
Problem  6.2.)  The  ring  of  matrices  introduced  in  the  next  section  is  not  commutative.  Some 
important  commutative  rings  are  introduced  in  Section  6.7.1. 

6.2.2  Matrices 

A  matrix  over  a  set  R  is  a  rectangular  array  of  elements  drawn  from  R  consisting  of  some 
number  m  of  rows  and  some  number  n  of  columns.  Rows  are  indexed  by  integers  from  the  set 
{1,2,3, .  . . ,  m}  and  columns  are  indexed  by  integers  from  the  set  { 1 ,  2,  3, .  . . ,  n} .  The  entry 
in  the  2th  row  and  jth  column  of  A  is  denoted  a,;,j ,  as  suggested  in  the  following  example: 


o- 1,1 

221,2 

22 1 ,3 

221,4 

'  1 

2 

3 

4 

222,1 

<22,2 

<22,3 

222,4 

= 

5 

6 

7 

8 

223,1 

<23,2 

223,3 

223,4 

9 

10 

11 

12 

Thus,  d2,3  =  7  and  <23,1  =  9. 

The  transpose  of  a  matrix  A,  denoted  AT ,  is  the  matrix  obtained  from  A  by  exchanging 
rows  and  columns,  as  shown  below  for  the  matrix  A  above: 

'15  9' 

aT  2  6  10 

3  7  11 

_  4  8  12  _ 

Clearly,  the  transpose  of  the  transpose  of  a  matrix  A,  ( AT)T ,  is  the  matrix  A. 

A  column  72-vector  a:  is  a  matrix  containing  one  column  and  n  rows,  for  example: 


Xi 

'  5  ' 

x2 

6 

.  Xn  . 

.  8  . 

A  row  m-vector  y  is  a  matrix  containing  one  row  and  m  columns,  for  example: 

y  =  [yu  yi,  ■  =  [i,5,...,9] 

The  transpose  of  a  row  vector  is  a  column  vector  and  vice  versa. 

A  square  matrix  is  an  n  X  n  matrix  for  some  integer  n.  The  main  diagonal  of  an  72  X  n 
square  matrix  A  is  the  set  of  elements  {<21,1,  <22,2 ,  ■  ■  • ,  an_ i,„_i,  an,„}.  The  diagonal  below 
(above)  the  main  diagonal  is  the  elements  {<22,1,  <23,2, . . . ,  ara,„_  1 }  ({<21,2,  <22,3, . . . ,  an_i,„j). 
The  72  X  72  identity  matrix,  In,  is  a  square  n  X  n  matrix  with  value  1  on  the  main  diagonal 
and  0  elsewhere.  The  n  X  n  zero  matrix,  0„,  has  value  0  in  each  position.  A  matrix  is  upper 
(lower)  triangular  if  all  elements  below  (above)  the  main  diagonal  are  0.  A  square  matrix  A  is 
symmetric  if  A  =  AT,  that  is,  a,;,-  =  aj,i  for  all  1  <  i,  j <  n. 

The  scalar  product  of  a  scalar  c  €  R  and  an  72  x  m  matrix  A  over  R,  denoted  cA,  has 
value  Cdij  in  row  i  and  column  j . 
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The  matrix-vector  product  between  an  to  x  n  matrix  A  and  a  column  n-vector  x  is  the 
column  m-vector  b  below: 


a  u 

Ol,2 

•  ®l,n 

X\ 

a2,l 

02,2 

•  ^2  ,n 

x2 

b  =  Ax  = 

X 

Q"m,  1 

On- 1,2 

•  dm— l,n 

xn—l 

1 

On, 2 

•  ^m,n 

xn 

0-1,1  *  #1 

+ 

a  1,2  *  x2 

+  • 

•  •  + 

,n  * 

a2,i  *  #i 

+ 

a2,2  *  X2 

+  • 

•  •  + 

^2,n  * 

ara-l,l  *  x\ 

+ 

am— 1,2  *  ^2 

+  • 

H-  CLrn_i  jl  *  X , 

dm,  1  *  ^1 

+ 

®ra, 2  *  3?  2 

+  • 

+  &ra,n  * 

Thus,  bj  is  defined  as  follows  for  1  <  j  <  n: 

bj  =  0»,i  *  Xi  +  da  *X2~\ - h  aitm  *  Xm 

The  matrix-vector  product  between  a  row  TO-vector  x  and  an  to  x  n  matrix  A  is  the  row 
n-vector  b  below: 

b  =  [bi]  =  xA 

where  for  1  <  i  <  n  bi  satisfies 

h  =  x  i  *  ai,i  +  x2*  a2ti  H - b  xm  *  am,i 

The  special  case  of  a  matrix- vector  product  between  a  row  n-vector,  x,  and  a  column  n  vector, 
y,  denoted  x  ■  y  and  defined  below,  is  called  the  inner  product  of  the  two  vectors: 

n 

x  ■  y  =  ^  Xi  *  yj 

i= 1 

If  the  entries  of  the  n  X  n  matrix  A  and  the  column  n-vectors  x  and  b  shown  below  are 
drawn  from  a  ring  1Z  and  A  and  b  are  given,  then  the  following  matrix  equation  defines  a 
linear  system  of  n  equations  in  the  n  unknowns  X: 

Ax  =  b 

An  example  of  a  linear  system  of  four  equations  in  four  unknowns  is 


1  *  Xi 

+ 

2  *  x2 

+ 

3*x3 

+ 

4*  X4 

=  17 

5  *  X\ 

+ 

6*  x2 

+ 

7  *  x3 

+ 

8  *  X4 

=  18 

9  *X\ 

+ 

10  *  x2 

+ 

11  *  x3 

+ 

12  *  X4 

=  19 

13  *  Si 

+ 

14  *  x2 

+ 

15  *  x3 

+ 

16  *  X4 

=  20 
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It  can  be  expressed  as  follows: 


to 

uo 

4^ 

'  Xi 

'  17  ' 

5  6  7  8 

x2 

18 

9  10  11  12 

x} 

19 

.  13  14  15  16  . 

X4 

.  20  . 

Solving  a  linear  system,  when  it  is  possible,  consists  of  finding  values  for  x  given  values  for 
A  and  b.  (See  Section  6.6.) 

Consider  the  set  of  to  X  n  matrices  whose  entries  are  drawn  from  a  ring  1Z.  The  matrix 
addition  function  '■  7Z2mn  i— >  7?.mri  on  two  TO  x  n  matrices  A  =  [a,;j]  and  B  =  [bij\ 

generates  a  matrix  C  =  {A,  B)  =  A  +m,n  B  =  \ci,j\,  where  +m,n  is  the  infix  matrix 

addition  operator  and  Cj  j  is  defined  as 

(-■j  yj  —  Cl  j  yj  bij 

The  straight-line  program  based  on  this  equation  uses  one  instance  of  the  ring  addition  op¬ 
erator  +  for  each  entry  in  C.  It  follows  that  over  the  basis  {+},  C+  ( f^A+ls  )  =  mn  and 

D+(f  =  1.  Two  special  cases  of  matrix  addition  are  the  addition  of  square  matrices 

(to  =  n),  denoted  +n,  and  the  addition  of  row  or  column  vectors  that  are  either  1  X  n  or 
to  X  1  matrices. 

The  matrix  multiplication  function  f^xB  ■  'TZ,'rrL+p>rl  i— >  TZ'np  multiplies  an  to  X 
n  matrix  A  =  [ciif  by  an  n  x  p  matrix  B  =  \b%yj]  to  produce  the  to  X  p  matrix  C  = 
B)  =  A  xn  B  =  [cfj],  where 

n 

p‘i,j  =  ^  ^  ci itk  *  b^  j  (h-1) 

fc= l 

and  X  n  is  the  infix  matrix  multiplication  operator.  The  subscript  on  X  n  is  usually  dropped 
when  the  dimensions  of  the  matrices  are  understood.  The  standard  matrix  multiplication 
algorithm  for  multiplying  an  to  X  n  matrix  A  by  an  n  X  p  matrix  B  forms  mp  inner  products 
of  the  kind  shown  in  equation  (6.1).  Thus,  it  uses  mnp  instances  of  the  ring  multiplication 
operator  and  m(n  —  1  )p  instances  of  the  ring  addition  operator. 

A  fast  algorithm  for  matrix  multiplication  is  given  in  Section  6.3.1.  It  is  now  straightfor¬ 
ward  to  show  the  following  result.  (See  Problem  6.4.) 

THEOREM  6.2. 1  Let  Mnxn  be  the  set  of  n  x  n  matrices  over  a  commutative  ring  1Z.  The 
system  Ainxn  =  ( Mnxn,+n >  Xn>On,In),  where  +„  and  xn  are  the  matrix  addition  and 
multiplication  operators  and  0„  and  In  are  then  x  n  zero  and  identity  matrices,  is  a  ring. 

The  ring  of  matrices  A4nxn  is  not  a  commutative  ring  because  matrix  multiplication  is  not 
commutative.  For  example,  the  following  two  matrices  do  not  commute,  that  is,  AB  f  BA: 


'  0  1  ’ 

’  i  o' 

A  = 

1  0 

B  = 

0  -1 

A  linear  combination  of  a  subset  of  the  rows  of  an  n  x  m  matrix  A  is  a  sum  of  scalar 
products  of  the  rows  in  this  subset.  A  linear  combination  is  non-zero  if  the  sum  of  the  scalar 
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product  is  not  the  zero  vector.  A  set  of  rows  of  a  matrix  A  over  a  field  7Z  is  linearly  indepen¬ 
dent  if  all  linear  combinations  are  non-zero  except  when  each  scalar  is  zero. 

The  rank  of  an  n  X  m  matrix  A  over  a  field  7Z,  '■  l— *  IN,  *s  the  maximum 

number  of  linearly  independent  rows  of  A.  It  is  also  the  maximum  number  of  linearly  inde¬ 
pendent  columns  of  A.  (See  Problem  6.5.)  We  write  rank(A)  =  /©|k(A).  An  n  X  n  matrix 
A  is  non-singular  if  rank(A)  =  n. 

If  an  n  X  n  matrix  A  over  a  field  1Z  is  non-singular,  it  has  an  inverse  A- 1  that  is  an  n  x  n 
matrix  with  the  following  properties: 

AA”1  =  A-1A  =  In 

where  In  is  the  n  x  n  identity  matrix.  That  is,  there  is  a  (partial)  inverse  function  /©  : 
TZn>  i  >  Tinl  that  is  defined  for  non-singular  square  matrices  A  such  that  (A)  =  A  1 . 
/©  is  partial  because  it  is  not  defined  for  singular  matrices.  Below  we  exhibit  a  matrix  and  its 
inverse  over  a  field  1Z. 


1 

1 

-1 

1 

-1 

-1 

1 

1 

1 

Algorithms  for  matrix  inversion  are  given  in  Section  6.5. 

We  now  show  that  the  inverse  (AB)-1  of  the  product  AB  of  two  invertible  matrices,  A 
and  B,  over  a  field  1Z  is  the  product  of  their  inverses  in  reverse  order. 

LEMMA  6.2. 1  Let  A  and  B  be  invertible  square  matrices  over  a  field  1Z.  Then  the  following 
relationship  holds: 

(AB)-1  =  B~lA~l 

Proof  To  show  that  ( AB )_1  =  B_1A_1,  we  multiply  AB  either  on  the  left  or  right  by 
B~x  A"1  to  produce  the  identity  matrix: 

AB(AB)-1  =  ABB~1A~l  =  A(BB~1)A~1  =  AA-1  =  I 

(AB)~1AB  =  B~1A~1AB  =  B~1(A~1A)B  =  B~1B  =  I  u 

The  transpose  of  the  product  of  an  m  X  n  matrix  A  and  an  n  X  p  matrix  B  over  a  ring  1Z 
is  the  product  of  their  transposes  in  reverse  order: 

(AB)T  =  BTAT 

(See  Problem  6.6.)  In  particular,  the  following  identity  holds  for  an  m  X  n  matrix  A  and  a 
column  n-vector  x : 


xtAt  =  (Axf 

A  block  matrix  is  a  matrix  in  which  each  entry  is  a  matrix  with  fixed  dimensions.  For 
example,  when  n  is  even  it  may  be  convenient  to  view  an  nx  n  matrix  as  a  2  x  2  matrix  whose 
four  entries  are  (n/2)  X  (n/2)  matrices. 

Two  special  types  of  matrix  that  are  frequently  encountered  are  the  Toeplitz  and  circulant 
matrices.  An  n  X  n  Toeplitz  matrix  T  has  the  property  that  its  (i,j)  entry  =  ar  for 
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j  =  i  —  n+l+r  and  0  <  r  <  2  n  —  2.  A  generic  Toeplitz  matrix  T  is  shown  below: 


On—  1 

On 

On+\ 

•  02n-2 

On— 2 

On—  1 

On 

•  02n—3 

T  = 

0*71—3 

On-2 

On—  1 

•  02n-A 

Oq 

o\ 

02 

•  on— 1 

An  n  x  n  circulant  matrix  C  has  the  property  that  the  entries  on  the  fcth  row  are  a  right 
cyclic  shift  by  k  —  1  places  of  the  entries  on  the  first  row,  as  suggested  below. 


a0 

a\ 

d2  . 

On—  1 

O'n—  1 

do 

di  . 

On— 2 

c  = 

dn-2 

dn—  1 

do  ■ 

On— 3 

d2 

d3  . 

Oq 

The  circulant  is  a  type  of  Toeplitz  matrix.  Thus  the  function  defined  by  the  product  of  a 
Toeplitz  matrix  and  a  vector  contains  as  a  subfunction  the  function  defined  by  the  product  of 
a  circulant  matrix  and  a  vector.  Consequently,  any  algorithm  to  multiply  a  vector  by  a  Toeplitz 
matrix  can  be  used  to  multiply  a  circulant  by  a  vector. 

As  stated  in  Section  2.11,  a  permutation  7t  :  lZn  i— >  7 Zn  of  an  n-tuple  x  =  (xi,  X2,  •  •  • , 
xn)  over  the  set  1Z  is  a  rearrangement  7 r(x)  =  (x^-m,  x^m,  •  ■  ■ ,  x ^(n))  of  the  components 
of  x.  An  x  n  permutation  matrix  P  has  entries  from  the  set  {0, 1}  (here  0  and  1  are  the 
identities  under  addition  and  multiplication  for  a  ring  1Z)  with  the  property  that  each  row 
and  column  of  P  has  exactly  one  instance  of  1 .  (See  the  example  below.)  Let  A  be  an  n  x  n 
matrix.  Then  AP  contains  the  columns  of  A  in  a  permuted  order  determined  by  P.  A  similar 
statement  applies  to  PA.  Shown  below  is  a  permutation  matrix  P  and  the  result  of  multiplying 
it  on  the  right  by  a  matrix  A  on  the  left.  In  this  case  P  interchanges  the  first  two  columns  of  A. 


"  i 

2 

3 

4  ' 

'  0 

1 

0 

0  ' 

'  2 

1 

3 

4  ' 

5 

6 

7 

8 

1 

0 

0 

0 

6 

5 

7 

8 

9 

10 

11 

12 

0 

0 

1 

0 

10 

9 

11 

12 

.  13 

14 

15 

16  . 

.  0 

0 

0 

1  _ 

14 

13 

15 

16  . 

6.3  Matrix  Multiplication 

Matrix  multiplication  is  defined  in  Section  6.2.  The  standard  matrix  multiplication  algo¬ 
rithm  computes  the  matrix  product  using  the  formula  for  given  in  (6.1).  It  performs  nmp 
multiplications  and  n(m  —  1  )p  additions.  As  shown  in  Section  6.3.1,  however,  matrices  can 
be  multiplied  with  many  fewer  operations. 

Boolean  matrix  multiplication  is  matrix  multiplication  for  matrices  over  B  when  +  de¬ 
notes  OR  and  *  denotes  AND.  Another  example  is  matrix  multiplication  over  the  set  of  integers 
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modulo  a  prime  p,  a  set  that  forms  a  finite  field  under  addition  and  multiplication  modulo  p. 
(See  Problem  6.3.) 

In  the  next  section  we  describe  Strassen’s  algorithm,  a  straight-line  program  realizable  by  a 
logarithmic-depth  circuit  of  size  0(n2'807) .  This  is  not  the  final  word  on  matrix  multiplication, 
however.  Winograd  and  Coppersmith  [81]  have  improved  the  bound  to  0(n2  38).  Despite 
this  progress,  the  smallest  asymptotic  bound  on  matrix  multiplication  remains  unknown. 

Since  later  in  this  chapter  we  design  algorithms  that  make  use  of  matrix  multiplication, 
it  behooves  us  to  make  the  following  definition  concerning  the  number  of  ring  operations  to 
multiply  two  n  X  n  matrices  over  a  ring  7Z. 

DEFINITION  6.3. 1  Let  K  >  1.  Then  Mmatrix(n,  K )  is  the  size  of  the  smallest  circuit  of  depth 
K  log2  n  over  a  commutative  ring  1Z  for  the  multiplication  of  two  n  X  n  matrices. 

The  following  assumptions  on  the  rate  of  growth  of  Mmatrjx(n,  K)  with  n  make  subse¬ 
quent  analysis  easier.  They  are  satisfied  by  Strassen’s  algorithm. 

ASSUMPTION  6.3.1  We  assume  that  for  all  c  satisfying  0  <  c  <  1  and  n  >  1, 

A'/rnatrix  ftl,  '  C  Adniatrix  (ft,  A  ) 

ASSUMPTION  6.3.2  We  assume  there  exists  an  integer  no  >  0  such  that,  for  n  >  no, 

2 n2  <  Mmatrix(n,  K) 

6.3.1  Strassen’s  Algorithm 

Strassen  [3 1 9]  has  developed  a  fast  algorithm  for  multiplying  two  square  matrices  over  a  com¬ 
mutative  ring  1Z.  This  algorithm  makes  use  of  the  additive  inverse  of  ring  elements  to  reduce 
the  total  number  of  operations  performed. 

Let  n  be  even.  Given  two  n  x  n  matrices,  A  and  B,  we  write  them  and  their  product  C 
as  2  x  2  matrices  whose  components  are  (nj 2)  X  (nj 2)  matrices: 


U  V 

=  Ax  B  = 

a  b 

1 

O) 

1 _ 

c  = 

X 

W  X 

c  d 

- 1 

_ 1 

Using  the  standard  algorithm,  we  can  form  C  with  eight  multiplications  and  four  additions 
of  (nj 2)  X  (nj 2)  matrices.  Strassen’s  algorithm  exchanges  one  of  these  multiplications  for 
10  such  additions.  Since  one  multiplication  of  two  (nj 2)  x  (nj 2)  matrices  is  much  more 
costly  than  an  addition  of  two  such  matrices,  a  large  reduction  in  the  number  of  operations  is 
obtained.  We  now  derive  Strassen’s  algorithm. 

Let  D  be  the  the  4x4  matrix  shown  below  whose  entries  are  (nj 2)  x  (nj 2)  matrices. 
(Thus,  D  is  a  2 n  X  2 n  matrix.) 


’a  b  0  O' 

c  d  0  0 

0  0  a  b 

0  0  c  d 
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The  entries  u,  v,  w,  and  x  of  the  product  A  x  B  can  also  be  produced  by  the  following 
matrix-vector  product: 


u 

e 

w 

=  D  x 

9 

V 

f 

X 

h 

We  now  write  D  as  a  sum  of  seven  matrices  as  shown  in  Fig.  6.2;  that  is, 
D  =  A\  +  A2  +  A3  +  A4  +  A3  +  A g  +  A-j 
Let  P\,  P2, . . . ,  P7  be  the  products  of  the  (n/ 2)  X  ( n/2 )  matrices 


Pi  = 

(a  +  d)  x  (e  +  h) 

Pi 

=  (a  +  b)  x  h 

p2  = 

(c  +  d)  x  e 

P6 

=  (-a  +  c)  x  (e  +  /) 

Pi  = 

5" 

i 

X 

e 

Pi 

=  (b  -  d)  x  {g  +  h) 

p4  = 

d  x  (-e  +  g) 

Ai 


A3 


A, 


A7 


a  +  d  0  0  a  +  d 

0  0  0  0 

0  0  0  0 

a  +  d  0  0  a  +  d 


‘  0  0  0  0  ' 

0  0  0  0 

0  0a  —a 

_00a  —a 

'  0  0  0  —(a  +  b) 

0  0  0  0 

0  0  0  a  +  b 

.  0  0  0  0 

0  b  —  d  0  b  —  d 

0  0  0  0 

0  0  0  0 

0  0  0  0 


A2 


a4 


A( ; 


0  0  0  0  ' 

c  +  d  0  0  0 

0  0  0  0 

.  -(c+d)  0  0  0  . 

— d  d  0  0' 

— d  d  0  0 

0  0  0  0 

0  0  0  0  . 

0  0  0  O' 

0  0  0  0 

0  0  0  0 

—a  +  c  0  —a  +  c  0 


Figure  6.2  The  decomposition  of  the  4x4  matrix  D  as  the  sum  of  seven  4x4  matrices. 
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Then  the  product  of  the  vector  [e,g,  /,  h]T  with  D  is  the  following  sum  of  seven  column 
vectors. 


u 

r  Pi  1 

'  0  ■ 

■  0  ■ 

r  Pa  1 

r  -Pi  1 

■  0  ■ 

r  Pi  1 

w 

0 

+ 

Pi 

+ 

0 

+ 

Pa 

+ 

0 

+ 

0 

+ 

0 

V 

0 

0 

Pi 

0 

Pi 

0 

0 

X 

.  Pi . 

-Pi 

.  Pi . 

0 

0 

.  p(, . 

0 

It  follows  that  u,  v,  w,  and  x  are  given  by  the  following  equations: 


u  =  P\  +  P4  -  P5  +  P7  v  =  P}  +  P5 

W  =  P2P  Pi  X  =  P\  —  P2  +  P}  +  P(, 

Associativity  and  commutativity  under  addition  and  distributivity  of  multiplication  over  ad¬ 
dition  are  used  to  obtain  this  result.  In  particular,  commutativity  of  the  ring  multiplication 
operator  is  not  assumed.  This  is  important  because  it  allows  this  algorithm  to  be  used  when 
the  entries  in  the  original  2x2  matrices  are  themselves  matrices,  since  matrix  multiplication 
is  not  commutative. 

Thus,  an  algorithm  exists  to  form  the  product  of  two  n  X  n  matrices  with  seven  multi¬ 
plications  of  (n/ 2)  x  (n/2)  matrices  and  18  additions  or  subtractions  of  such  matrices.  Let 
n  =  2k  and  M[k )  be  the  number  of  operations  over  the  ring  7Z  used  by  this  algorithm  to 
multiply  n  x  n  matrices.  Then,  M(k )  satisfies 

M(k)  =  7 M{k  -  1)  +  18  (2fe-1)2  =  7 M(k  -  1)  +  (18)4fc-! 

If  the  standard  algorithm  is  used  to  multiply  2x2  matrices,  M(  1)  =  12  and  M(k)  satisfies 
the  following  recurrence: 

M(k)  =  (36/7)7fe  -  (18/3)4fc 

The  depth  (number  of  operations  on  the  longest  path),  D(k),  of  this  straight-line  algo¬ 
rithm  for  the  product  of  two  n  x  n  matrices  when  n  =  2k  satisfies  the  following  bound: 

D{k)  =  D(k  -  1)  +  3 

because  one  level  of  addition  or  subtraction  is  used  before  products  are  formed  and  one  or  two 
levels  are  used  after  they  are  formed.  Since  D(  1)  =  2  if  the  standard  algorithm  is  used  to 
multiply  2x2  matrices,  D(k )  =  3fc  —  1  =  3  log  n  —  1. 

These  size  and  depth  bounds  can  be  improved  to  those  in  the  following  theorem  by  using 
the  standard  matrix  multiplication  algorithm  on  small  matrices.  (See  Problem  6.8.) 

THEOREM  6.3. 1  The  matrix  multiplication  function  for  n  X  n  matrices  over  a  commutative  ring 
1Z,  /©  b  >  has  circuit  size  and  depth  satisfying  the  following  bounds  over  the  basis  f  l  containing 
addition,  midtiplication,  and  additive  inverse  over  1Z: 

Cn(.f^B)  <4.77n lo^7 
Dn(f^B)  =0(logn) 
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We  emphasize  again  that  subtraction  plays  a  central  role  in  Strassen’s  algorithm.  Without 
it  we  show  in  Section  10.4  that  the  standard  algorithm  is  nearly  best  possible. 

Strassen’s  algorithm  is  practical  for  sufficiently  large  matrices,  say  with  n  >  64.  It  can 
also  be  used  to  multiply  Boolean  matrices  even  though  the  addition  operator  (OR)  and  the 
multiplication  operator  (AND)  over  the  set  B  do  not  constitute  a  ring.  (See  Problem  6.9.) 

6.4  Transitive  Closure 

The  edges  of  a  directed  graph  G  =  ( V ,  E),n  =  |V|,  specify  paths  of  length  1  between  pairs  of 
vertices.  (See  Fig.  6.3.)  This  information  is  captured  by  the  Boolean  n  x  n  adjacency  matrix 
A  =  1  <  i,j  <  n,  where  a^j  is  1  if  there  is  an  edge  from  vertex  i  to  vertex  j  in  E  and 

0  otherwise.  (The  adjacency  matrix  for  the  graph  in  Fig.  6.3  is  given  after  Lemma  6.4.1.)  Our 
goal  is  to  compute  a  matrix  A*  whose  i,  j  entry  a*j  has  value  1  if  there  is  a  path  of  length 
0  or  more  between  vertices  i  and  j  and  value  0  otherwise.  A*  is  called  the  transitive  closure 
of  the  matrix  A.  The  transitive  closure  function  fjfj  :  Bn  i— >  Bn  maps  an  arbitrary  n  X  n 
Boolean  matrix  A  onto  its  n  X  n  transitive  closure  matrix;  that  is,  (A)  =  A*.  In  this 
section  we  add  and  multiply  Boolean  matrices  over  the  set  B  using  OR  as  the  element  addition 
operation  and  AND  as  the  element  multiplication  operation.  (Note  that  ( B ,  V,  A,  0,  1)  is  not 
a  ring;  it  satisfies  all  the  rules  for  a  ring  except  for  the  condition  that  each  element  of  B  have 
an  (additive)  inverse  under  V.) 

To  compute  A*  we  use  the  following  facts:  a)  the  entry  in  the  rth  row  and  sth  column 
of  the  Boolean  matrix  product  A2  =  A  X  A  is  1  if  there  is  a  path  containing  two  edges  from 
vertex  r  to  vertex  s  and  0  otherwise  (which  follows  from  the  definition  of  Boolean  matrix 
multiplication  given  in  Section  6.3),  and  b)  the  entry  in  the  rth  row  and  sth  column  of 
Ak  =  Ak~l  x  A  is  1  if  there  is  a  path  containing  k  edges  from  vertex  r  to  vertex  s  and  0 
otherwise,  as  the  reader  is  asked  to  show.  (See  Problem  6.1 1.) 

LEMMA  6.4. 1  Let  A  be  the  Boolean  adjacency  matrix  for  a  directed  graph  and  let  Ak  be  the  kth 
power  of  A.  Then  the  following  identity  holds  for  k  >  1,  where  +  denotes  the  addition  (OR,)  of 
Boolean  matrices: 


(. I  +  A)k  =  I  +  A  +  ---  +  Ak  (6.2) 

Proof  The  proof  is  by  induction.  The  base  case  is  k  =  1,  for  which  the  identity  holds. 
Assume  that  it  holds  for  k  <  K  —  1.  We  show  that  it  holds  for  k  =  K.  Since  (I+A)K~l  = 


2  3 


Figure  6.3  A  graph  that  illustrates  transitive  closure. 


6.4  Transitive  Closure 


249 


©John  E  Savage 


I  +  A-{ - \-AK~l  ,  multiply  both  sides  by  I  +  A: 

( I  +  A)K  =  (/  +  A)  x  (/  +  A)K~l 

=  (I  +  A)x(I  +  A-\ - b  AK~l) 

=  I  +  (A  +  A)  +  ---  +  (AK-1  +  AK~1)  +  AK 

However,  since  A°  is  a  Boolean  matrix,  A°  +  AJ  =  AJ  for  all  j  and  the  result  follows.  ■ 

The  adjacency  matrix  A  of  the  graph  in  Fig.  6.3  is  given  below  along  with  its  powers  up  to 
the  fifth  power.  Note  that  every  non-zero  entry  appearing  in  A3  appears  in  at  least  one  of  the 
other  matrices.  The  reason  for  this  fact  is  explained  in  the  proof  of  Lemma  6.4.2. 


'  0 

0 

1 

0 

1  ■ 

'  1 

0 

0 

1 

0  ■ 

'  0 

1 

1 

0 

1  ' 

0 

0 

1 

0 

0 

0 

0 

0 

1 

0 

0 

1 

0 

0 

0 

0 

0 

0 

1 

0 

A2 
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0 

1 

0 

0 

0 

A3  = 

0 

0 

1 

0 

0 

0 

1 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

1 

0 

1 

0 

0 

0 

0 

0 

0 

1 

0 

1 

1 

0 

0 

1 

0 

1 

0 

1 

1 

0  ■ 

'  0 

1 

1  1 

1  ' 

0 

0 

1 

0 

0 

0 

0 

0  1 

0 

A4 

= 

0 

0 

0 

1 

0 

A5 

= 

0 

1 

0  0 

0 

0 

1 

0 

0 

0 

0 

0 

1  0 

0 

_  0 

1 

1 

0 

1 

1 

0 

1  1 

0  _ 

LEMMA  6.4.2  If  there  is  a  path  between  pairs  of  vertices  in  the  directed  graph  G  =  (V,  E), 
n  =  \V\,  there  is  a  path  of  length  at  most  n  —  1. 

Proof  We  suppose  that  the  shortest  path  between  vertices  i  and  j  in  V  has  length  k  >  n. 
Such  a  path  has  k  +  1  vertices.  Because  k  +  1  >  n  +  1,  some  vertex  is  repeated  more  than 
once.  (This  is  an  example  of  the  pigeonhole  principle.)  Consider  the  subpath  defined  by  the 
edges  between  the  first  and  last  instance  of  this  repeated  vertex.  Since  it  constitutes  a  loop, 
it  can  be  removed  to  produce  a  shorter  path  between  vertices  i  and  j .  This  contradicts  the 
hypothesis  that  the  shortest  path  has  length  n  or  more.  Thus,  the  shortest  path  has  length 
at  most  n  —  1 .  ■ 

Because  the  shortest  path  has  length  at  most  n  —  1,  any  non-zero  entries  in  Ak,  k  >  n,  are 
also  found  in  one  of  the  matrices  AJ ,  j  <  n  —  1.  Since  the  identity  matrix  I  is  the  adjacency 
matrix  for  the  graph  that  has  paths  of  length  zero  between  two  vertices,  the  transitive  closure, 
which  includes  such  paths,  is  equal  to: 

A*  =  I  +  A  +  A2  +  A3  +  ■  ■  ■  +  A"-1  =  (I  +  A)^1 

It  also  follows  that  A*  =  (/  +  A)k  for  all  k  >  n  —  1 ,  which  leads  to  the  following  result. 

THEOREM  6.4. 1  Over  the  basis  H  =  {AND,  OR}  the  transitive  closure  function,  /© ,  has  circuit 
size  and  depth  satisfying  the  following  bounds  (that  is,  a  circuit  of  this  size  and  depth  can  be 
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constructed  with  AND  and  OR  gates  for  it): 

Cq  (fA*)  <  A^matri  x(cn,  K)  |"log2  n] 

Dn  <  K (log  n)  |"log2  n\ 

Proof  Let  k  =  2P  be  the  smallest  power  of  2  such  that  k  >  n—  1.  Then,  p  =  |~log2(n—  1)]. 
Since  A*  =  (/  +  A)k,  it  can  be  computed  with  a  circuit  that  squares  the  matrix  I  +  A  p 
times.  Each  squaring  can  be  done  with  a  circuit  for  the  standard  matrix  multiplication  algo¬ 
rithm  described  in  (6.1)  using  Mmatrix(cn,  K)  =  0(n3)  operations  and  depth  |~log2  2 n] . 
The  desired  result  follows.  ■ 

The  above  statement  says  that  the  transitive  closure  function  on  n  X  n  matrices  has  circuit 
size  and  depth  at  most  a  factor  O(logn)  times  that  of  matrix  multiplication.  We  now  show 
that  Boolean  matrix  multiplication  is  a  subfunction  of  the  transitive  closure  function,  which 
implies  that  the  former  has  a  circuit  size  and  depth  no  larger  than  the  latter.  We  subsequently 
show  that  the  size  bound  can  be  improved  to  a  constant  multiple  of  the  size  bound  for  matrix 
multiplication.  Thus  the  transitive  closure  and  Boolean  matrix  multiplication  functions  have 
comparable  size. 

THEOREM  6.4.2  Then  X  n  matrix  multiplication  function  f axb  •  7?.2n  l— >  7Ln  for  Boolean 
matrices  is  a  subfimction  of  the  transitive  closure  fimction  f '■  7Z18n  i— >  Tf>n  . 

Proof  Observe  that  the  following  relationship  holds  for  n  x  n  matrices  A  and  B,  since  the 
third  and  higher  powers  of  the  3 n  X  3 n  matrix  on  the  left  are  0. 


'  0 

A 

0 

* 

'  I 

A 

AB  ' 

0 

0 

B 

= 

0 

I 

B 

0 

0 

0 

_  0 

A 

I 

It  follows  that  the  product  AB  of  n  x  n  matrices  is  a  subfunction  of  the  transitive  closure 
function  on  a  3n  x  3 n  matrix.  ■ 

COROLLARY  6.4. 1  It  follows  that 


Cn  (/&>*) 

<Cn( 

/£”’) 

Dn  (/<&,) 

<Dn  ( 

/?•”’) 

over  the  basis  f \  =  {AND,  OR}. 

Not  only  can  a  Boolean  matrix  multiplication  algorithm  be  devised  from  one  for  transitive 
closure,  but  the  reverse  is  also  true,  as  we  show.  Let  n  be  a  power  of  2  and  divide  an  n  X  n 
matrix  A  into  four  [n/ 2)  x  (n/ 2)  matrices: 


A  = 


U  V 
W  X 


(6.3) 
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Compute  X*  recursively  and  use  it  to  form  Y  =  U  +  VX*W  by  performing  two  multiplica¬ 
tions  of  (n/ 2)  X  (n/2)  matrices  and  one  addition  of  such  matrices.  Recursively  form  Y*  and 
then  assemble  the  matrix  B  shown  below  with  four  further  multiplications  and  one  addition 
of  (n/2)  X  (n/2)  matrices. 

Y*  Y*VX* 

B  =  (6.4) 

X*WY*  X*  +X*WY*VX* 

We  now  show  that  B  =  A*. 

THEOREM  6.4.3  Under  Assumptions  6.3.1  and  6.3.2,  a  circuit  of  size  0(Mmatrix(n,  K))  and 
depth  0(n)  exists  to  form  the  transitive  closure  ofn  x  n  matrices. 

Proof  We  assume  that  n  is  a  power  of  2  and  use  the  representation  for  the  matrix  A  given 
in  (6.3).  If  n  is  not  a  power  of  2,  we  augment  the  matrix  A  by  embedding  it  in  a  larger 
matrix  in  which  all  the  new  entries,  are  0  except  for  the  new  diagonal  entries,  which  are  1. 
Given  that  AM(n)  <  M(2n),  the  bound  applies. 

We  begin  by  showing  that  B  =  A* .  Let  F  C  V  and  S  C  V  be  the  first  and  second 
sets  of  n/2  vertices,  respectively,  corresponding  to  the  first  and  second  halves  of  the  rows 
and  columns  of  the  matrix  A.  Then,  F  U  S  =  V  and  F  (~1  S  =  0.  Observe  that  X*  is 
the  adjacency  matrix  for  those  paths  originating  on  and  terminating  with  vertices  in  F  and 
visiting  no  other  vertices.  Similarly,  Y  =  U  +  VX*W  is  the  adjacency  matrix  for  those 
paths  consisting  of  an  edge  from  a  vertex  in  F  to  a  vertex  in  F  or  paths  of  length  more 
than  1  consisting  of  an  edge  from  vertices  in  F  to  vertices  in  S,  a  path  of  length  0  or  more 
within  vertices  in  S,  and  an  edge  from  vertices  in  S  to  vertices  in  F.  It  follows  that  Y*  is 
the  adjacency  matrix  for  all  paths  between  vertices  in  F  that  may  visit  any  vertices  in  V.  A 
similar  line  of  reasoning  demonstrates  that  the  other  entries  of  A*  are  correct. 

The  size  of  a  circuit  realizing  this  algorithm,  T(n),  satisfies 

T(n)  =  2T(n/2)  +  6Mmatrix(n/2,  K)  +  2(n/2)2 

because  the  above  algorithm  (see  Fig.  6.4)  uses  two  circuits  for  transitive  closure  on  (n/2)  X 
(n/2)  matrices,  six  circuits  for  multiplying,  and  two  for  adding  two  such  matrices. 

Because  we  assume  that  n2  <  Mmatrix(n,  K),  it  follows  that  T(n)  <  2T(n/2)  + 
8Mmatrix(n/2,  K).  Let  T(m)  <  cMmatrix(cm,  K)  for  m  <  n/2  be  the  inductive  hy¬ 
pothesis.  Then  we  have  the  inequalities 

T(n)  <  (2c  +  8)Mmatrix(n/2,  K)  <  (c/2  +  2)Mmatrix(n,  K) 

which  follow  from  Mmatrix(n/2,  AT)  <  Mmatrix(tT->  -ff)/4  (see  Assumption  6.3.2).  Because 
(c/2  +  2)  <  c  for  c  >  4,  for  c  =  4  we  have  the  desired  bound  on  circuit  size. 

The  depth  D(n)  of  the  above  circuit  satisfies  D(n)  =  2D(n/2)  +  6K  log2  n,  from 
which  we  conclude  that  D(n)  =  0(n).  ■ 

A  semiring  (S,  +,  -,  0, 1 )  is  a  set  S,  two  operations  +  and  •  and  elements  0,1  €  5  with 
the  following  properties: 

a)  S  is  closed  under  +  and  •; 

b)  +  and  •  are  associative; 
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Figure  6.4  A  circuit  for  the  transitive  closure  of  a  Boolean  matrix  based  on  the  construction  of 
equation  (6.4). 


c)  for  all  a  £  S,  a  +  0  =  0  +  a  =  a; 

d)  for  all  a  £  S,  a  •  1  =  1  •  a  =  a; 

e)  +  is  commutative  and  idempotent;  i.e.  a  +  a  =  a; 

f)  •  distributes  over  +;  i.e.  for  all  a,  b,  c  £  S,  a  •  (b  +  c)  =  a  ■  b  +  a  •  c 
and  (6  +  c)  •  a  =  b  ■  a  +  c  •  a. 

The  above  definitions  and  results  generalize  to  matrices  over  semirings.  To  show  this,  it  suf¬ 
fices  to  observe  that  the  properties  used  to  derive  these  results  are  just  these  properties.  (See 
Problem  6.12.) 


6.5  Matrix  Inversion 

The  inverse  of  a  non-singular  n  x  n  matrix  M  defined  over  a  field  1Z  is  another  matrix 
whose  product  with  M  is  the  n  x  n  identity  matrix  I ;  that  is, 


MM-1  =  M~XM  =  I 
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Given  a  linear  system  of  n  equations  in  the  column  vector  x  of  n  unknowns  defined  by 
the  non-singular  n  x  n  coefficient  matrix  M  and  the  vector  b,  namely, 

Mx  =  b  (6.5) 

the  solution  x  can  be  obtained  through  a  matrix-vector  multiplication  with  M ~ 1 : 

x  =  M~lb 

In  this  section  we  present  two  algorithms  for  matrix  inversion.  Such  algorithms  compute 
the  (partial)  matrix  inverse  function  f'f-,  :  TZn  1 — >  TZn  that  maps  non-singular  n  X  n 
matrices  over  a  field  1Z  onto  their  inverses.  The  first  result,  Theorem  6.5.4,  demonstrates  that 
Cf2  ^  =  ©  (-^matrix  (jl,  K))  with  a  circuit  whose  depth  is  more  than  linear  in  n.  The 

second,  Theorem  6.5.6,  demonstrates  that  Dq  ^ /©,  j  =  0(log2  n )  with  a  circuit  whose  size 
is  0(nfffmatrjx(n,  Kfi. 

Before  describing  the  two  matrix  inversion  algorithms,  we  present  a  result  demonstrating 
that  matrix  multiplication  of  nx  n  matrices  is  no  harder  than  inverting  a  3  n  X  3  n  matrix;  the 
function  defining  the  former  task  is  a  subfunction  of  the  function  defining  the  latter  task. 

LEMMA  6.5. 1  The  matrix  inverse  function  contains  as  a  subfunction  the  function  /a^b  ■ 
lZln  i  >  TZn  that  maps  two  matrices  over  1Z  to  their  product. 

Proof  The  proof  follows  by  writing  a  3n  x  3n  matrix  as  a  3  X  3  matrix  of  n  x  n  matrices 
and  then  specializing  the  entries  to  be  the  identity  matrix  /,  the  zero  matrix  0,  or  matrices 
A  and  B: 


'  / 

A 

0 

-1 

'  I 

-A 

AB  ' 

0 

I 

B 

= 

0 

I 

—B 

0 

0 

I 

0 

0 

I 

This  identity  is  established  by  showing  that  the  product  of  these  two  matrices  is  the  identity 
matrix.  ■ 

6.5.1  Symmetric  Positive  Definite  Matrices 

Our  first  algorithm  to  invert  a  non-singular  n  X  n  matrix  M  has  a  circuit  size  linear  in 
dTmatrix(n,  I\ ),  which,  in  light  of  Lemma  6.5.1,  is  optimal  to  within  a  constant  multiplicative 
factor.  This  algorithm  makes  use  of  symmetric  positive  definite  matrices,  the  Schur  comple¬ 
ment,  and  LDLt  factorization,  terms  defined  below.  This  algorithm  has  depth  0(n  log2  n). 

The  second  algorithm,  Csanky’s  algorithm,  has  circuit  depth  0( log2  n),  which  is  smaller, 
but  circuit  size  0(nMmatrix(ti,  K)),  which  is  larger.  Symmetric  positive  definite  matrices  are 
defined  below. 

DEFINITION  6.5.1  ^  matrix  M  is  positive  definite  if  for  all  non-zero  vectors  x  the  following 
condition  holds: 

xT Mx  =  ^  Xirrii'jXj  >  0 

A  matrix  is  symmetric  positive  definite  (SPD)  if  it  is  both  symmetric  and  positive  definite. 


(6.6) 
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We  now  show  that  an  algorithm  to  invert  SPD  matrices  can  be  used  to  invert  arbitrary 
non-singular  matrices  by  adding  a  circuit  to  multiply  matrices. 

LEMMA  6.5.2  IfM  is  a  non-singular  n  x  n  matrix,  then  the  matrix  P  =  MT  M  is  symmetric 
positive  definite.  M  can  he  inverted  by  inverting  P  and  then  multiplying  P-1  by  MT .  Let 
/spd -inverse  •  72™  p- >  72™  be  the  inverse  function  for  n  x  n  SPD  matrices  over  the  field  72. 
Then  the  size  and  depth  of  over  1Z  satisfy  the  following  bounds: 

C  (/a-i)  —  ^  (/SPD -inverse)  +  -^matrix (n,  K) 

D(f^)  <  D  (/spd -inverse)  +  O(logn) 

rp 

Proof  To  show  that  P  is  symmetric  we  note  that  ( MT  M )  =  MT M.  To  show  that  it  is 
positive  definite,  we  observe  that 

xT  Px  =  xtMtMx 
=  (Mx)T  Mx 

V 

rriijXj 

which  is  positive  unless  the  product  Mx  is  identically  zero  for  the  non-zero  vector  x.  But 
this  cannot  be  true  if  M  is  non-singular.  Thus,  P  is  symmetric  and  positive  definite. 

To  invert  M,  invert  P  to  produce  M_1  (AfT)  '.  If  we  multiply  this  product  on  the 
right  by  MT ,  the  result  is  the  inverse  M~l .  ■ 


6.5.2  Schur  Factorization 

We  now  describe  Schur  factorization.  Represent  an  n  X  n  matrix  M  as  the  2x2  matrix 


M  = 


M  ]  t  ]  M\t  2 

M2,i  M2, 2 


(6.7) 


where  M\  j,  M\,2 ,  M2, 1,  and  M2, 2  are  k  x  k,  k  x  n—k,  n  —  kx  k,  and  n  —  kxn  —  k  matrices, 
1  <  k  <  n  —  1.  Let  M\  j  be  invertible.  Then  by  straightforward  algebraic  manipulation  M 
can  be  factored  as 


1 

0 

1 — 1 

1 _ 

'  Mu 

0  ’ 

'  I  MfylMh2  ' 

M2,iMfi  I 

0 

S 

l 

O 

(6.8) 


Here  I  and  O  denote  identity  and  zero  matrices  (all  entries  are  zero)  of  a  size  that  conforms 
to  the  size  of  other  submatrices  of  those  matrices  in  which  they  are  found.  This  is  the  Schur 
factorization.  Also, 


S  =  M2,2  -  M2,1Mfy1Mh2 

is  the  Schur  complement  of  M .  To  show  that  M  has  this  factorization,  it  suffices  to  carry  out 
the  product  of  the  above  three  matrices. 
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The  first  and  last  matrix  in  this  product  are  invertible.  If  S  is  also  invertible,  the  middle 
matrix  is  invertible,  as  is  the  matrix  M  itself.  The  inverse  of  M,  M~l ,  is  given  by  the  product 


'  I  -Mf{Mh2  ' 

’  Mfy  0 

1 - 

o 

1 _ 

1 — I 

O 

_ i 

0  s~l 

—  M2j\Mfy  I 

(6.9) 


This  follows  from  three  observations:  a)  the  inverse  of  a  product  is  the  product  of  the  inverses 
in  reverse  order  (see  Lemma  6.2.1),  b)  the  inverse  of  a  2  X  2  upper  (lower)  triangular  matrix 
is  the  matrix  with  the  off-diagonal  term  negated,  and  c)  the  inverse  of  a  2  x  2  diagonal  matrix 
is  a  diagonal  matrix  in  which  the  zth  diagonal  element  is  the  multiplicative  inverse  of  the  zth 
diagonal  element  of  the  original  matrix.  (See  Problem  6.13  for  the  latter  two  results.) 

The  following  fact  is  useful  in  inverting  SPD  matrices. 


LEMMA  6.5.3  If  M  is  ann  X  n  SPD  matrix,  its  Scbur  complement  is  also  SPD. 

Proof  Represent  M  as  shown  in  (6.7).  In  (6.6)  let  x  =  u  ■  v;  that  is,  let  x  be  the  concate¬ 
nation  of  the  two  column  vectors.  Then 


xT  Mx 


M\'\U  +  M  \s2V 
M2.\U  T  AI22v 


uT  Mi'\U  +  uT  M\t2v  +  vT  M2,\U  +  vT  M2<2v 


If  we  say  that 


u  =  —  Mj  /  Mx_2  v 

and  use  the  fact  that  Mf2  =  M2>\  and  (M©)  =  (Mj©  '  =  M©,  it  is  straightforward 

to  show  that  S  is  symmetric  and 

xT  Mx  =  vT  Sv 

where  S  is  the  Schur  complement  of  M.  Thus,  if  M  is  SPD,  so  is  its  Schur  complement.  ■ 

6.5.3  Inversion  of  Triangular  Matrices 

Let  T  be  n  x  n  lower  triangular  and  non-singular.  Without  loss  of  generality,  assume  that 
n  =  2r .  ( T  can  be  extended  to  a  2r  x  2r  matrix  by  placing  it  on  the  diagonal  of  a  2r  X  2r 
matrix  along  with  a  2r  —  n  x  2r  —  n  identity  matrix.)  Represent  T  as  a  2  x  2  matrix  of 
n/ 2  x  n/ 2  matrices: 


Tu  0 
P2i  ]  73, 2 


The  inverse  of  T,  which  is  lower  triangular,  is  given  below,  as  can  be  verified  directly: 


T~  i 


T51  0 

rri — 1  rji  rji — 1  rji — 1 

-t2,2  J2,lJ  1,1  J- 2,2 
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Figure  6.5  A  recursive  circuit  TRI_INV[n]  for  the  inversion  of  a  triangular  matrix. 


This  representation  for  the  inverse  of  T  defines  the  recursive  algorithm  TRI_INV[n]  in 
Fig.  6.5.  When  n  =  1  this  algorithm  requires  one  operation;  on  an  n  x  n  matrix  it  requires 
two  calls  to  TRI_INV[n/2]  and  two  matrix  multiplications.  Let  /tr™\nv  :  TZ^n  +n^2  i— > 
'Rfn  +n)/2  be  the  function  corresponding  to  the  inversion  of  an  n  X  n  lower  triangular  ma¬ 
trix.  The  algorithm  TRLINVfn]  provides  the  following  bounds  on  the  size  and  depth  of  the 
smallest  circuit  to  compute  /tr"\nv. 

THEOREM  6.5. 1  Let  n  be  a  power  of  2.  Then  the  matrix  inversion  function  for  n  X  n 

lower  triangular  matrices  satisfies  the  following  bounds: 

^(/triL)  <  ^matrix (n,  K) 

^(/triiv)  =0(log2n) 

Proof  From  Fig.  6.5  it  is  clear  that  the  following  circuit  size  and  depth  bounds  hold  if  the 
matrix  multiplication  algorithm  has  circuit  size  Mmatrix(n,  K)  and  depth  RT  log2  n\ 

C  (/Snv)  <  2C  +  2Mmatrix(n/2)  K) 

D  (/t(riLv)  <  D  +  2iL  log  n 

The  solution  to  the  first  inequality  follows  by  induction  from  the  fact  that  Mmatr ;x  ( 1 ,  K)  = 
1  and  the  assumption  that  4Mmatrix(n/2,  K)  <  Mmatrix(n,  K ).  The  second  inequality 
follows  from  the  observation  that  d  >  0  can  be  chosen  so  that  dlog2(n/2)  +  clogn  < 
d  log2  n  for  any  c  >  0  for  n  sufficiently  large.  ■ 
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6.5.4  LDLt  Factorization  of  SPD  Matrices 

Now  that  we  know  that  the  Schur  complement  S  of  M  is  SPD  when  M  is  SPD,  we  can  show 
that  every  SPD  matrix  M  has  a  factorization  as  the  product  LDLT  of  a  unit  lower  triangular 
matrix  L  (each  of  its  diagonal  entries  is  the  multiplicative  unit  of  the  field  1Z),  a  diagonal 
matrix  D,  and  the  transpose  of  L. 

THEOREM  6.5.2  Every  n  x  n  SPD  matrix  M  has  a  factorization  as  the  product  M  =  LDLT , 
where  L  is  a  unit  lower  triangular  matrix  and  D  is  a  diagonal  matrix. 

Proof  The  proof  is  by  induction  on  n.  For  n  =  1  the  result  is  obvious  because  we  can  write 
[mi,i]  =  [l][mi,j][l].  Assume  that  it  holds  for  n  <  N  —  1.  We  show  that  it  holds  for 
n  —  N. 

Form  the  Schur  factorization  of  the  N  X  TV  matrix  M.  Since  the  k  x  k  submatrix  Miy 
of  M  as  well  as  the  n  —  k  x  n  —  k  submatrix  S  of  M  are  SPD,  by  the  inductive  hypothesis 
they  can  be  factored  in  the  same  fashion.  Let 

Mu  =  LjDjLf ,  S  =  L2D2Lj 


Then  the  middle  matrix  on  the  right-hand  side  of  equation  (6.8)  can  be  represented  as 


1 

o 

£ 

1 _ 

1 

o 

1 _ 

1 

o 

1 _ 

1 - 

o 

i _ 

0  s 

0  l2 

o  d2 

- 1 

cn 

O 

_ i 

Substituting  the  above  product  for  the  middle  matrix  in  (6.8)  and  multiplying  the  two  left 
and  two  right  matrices  gives  the  following  representation  for  M \ 


i 

1 

o 

Q 

1 _ 

1 

O 

1 

LjMfjMlt2  ' 

© 

7 

_ 1 

L2 

1 

O 

Di 

1 

0 

1 

Fh  <n 

Since  M  is  symmetric,  is  symmetric,  M\t2  =  Mfl,  and 

LfMl-11Mh  2  =  Lj  (M1“11)tM2t1  =  (M2,IM1;11L1)T 


(6.10) 


Thus,  it  suffices  to  compute  L\,  Zfi,  L2,  D2,  and  Af2>i  M]  L| .  ■ 


When  n  =  2r  and  k  =  n/2,  the  proof  of  Theorem  6.5.2  describes  a  recursive  procedure, 
LDLT[n],  defined  on  n  x  n  SPD  matrices  that  produces  their  LDLT  factorization.  Figure  6.6 
captures  the  steps  involved.  They  are  also  described  below. 

•  The  LDLT  factorization  of  the  n/2  x  n/2  matrix  is  computed  using  the  proce¬ 
dure  LDL1  [n/2]  to  produce  the  n/2  x  n/2  triangular  and  diagonal  matrices  L\  and  D\, 
respectively. 

•  The  product  M2yMf^ L\  =  M2ii  (T/1)  D]-1  which  may  be  computed  by  inverting  the 

lower  triangular  matrix  L\  with  the  operation  TRI_INV[n/2],  computing  the  product 
M2j  i  (Lf)  using  MULT  [n/2],  and  multiplying  the  result  with  D/1  using  a  procedure 
SCALE[n/2]  that  inverts  D\  and  multiplies  it  by  a  square  matrix. 


258 


Chapter  6  Algebraic  and  Combinatorial  Circuits 


Models  of  Computation 


Figure  6.6  An  algebraic  circuit  to  produce  the  LDLT  factorization  of  an  SPD  matrix. 


•  S  =  M2, 2  —  M2,\M1  \  M\,2  can  be  formed  by  multiplying  My  (yLl  *)  Dx  1  by  the 

transpose  of  M24  ( L /  )  using  MULT[n/2]  and  subtracting  the  result  from  M2,2  by  the 
subtraction  operator  SUB  [n/2]. 

•  The  LDLt  factorization  of  the  n/2  x  n/ 2  matrix  S  is  computed  using  the  procedure 
LDLT[n/2]  to  produce  the  n/2  X  n/2  triangular  and  diagonal  matrices  L2  and  D2,  re¬ 
spectively. 

Let’s  now  determine  the  size  and  depth  of  circuits  to  implement  the  algorithm  for  LDL r  [n] . 
Let  /ldlt  :  l— ^ "  +n^2  be  the  function  defined  by  the  LDLT  factorization  of  an  n  X  n 

SPD  matrix,  /t(”>inv  :  n^+^Z2  ^  U^+^Z2  be  the  inversion  of  an  n  x  n  lower  triangular 
matrix,  Js^]e  :  'Rn2+n  ^  TZ”1  be  the  computation  of  N(D~l)  for  an  n  X  n  matrix  N  and 
a  diagonal  matrix  D,  :  lZ2n  *  lZn  be  the  multiplication  of  two  n  x  n  matrices,  and 

/sub  ■  ^ 2n  l— >  7?"  the  subtraction  of  two  n  X  n  matrices.  Since  a  transposition  can  be  done 
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without  any  operators,  the  size  and  depth  of  the  circuit  for  LDLT  [n]  constructed  above  satisfy 
the  following  inequalities: 

C  (/ldlt)  <  C  +  C  (/s(c"/le2))  +  2C  +  C  (/s(unb/2))  +  2C 

D  (/&)  <  D  +  D  (/s(cl2))  +  2D  (/££>)  +  £>  (/s(ufy/2))  +  2£  (/&$) 

The  size  and  depth  of  a  circuit  for  /t  ™jnv  are  Mmatrix(n,  K)  and  0(log2  n),  as  shown  in 

Theorem  6.5.1.  The  circuits  for  f^Jlc  and  /©  have  size  n2  and  depth  1;  the  former  multiplies 
the  elements  of  the  jth  column  of  N  by  the  multiplicative  inverse  of  j  th  diagonal  element  of 
D i  for  1  <  j  <  n,  while  the  latter  subtracts  corresponding  elements  from  the  two  input 
matrices. 

Let  CspD(n)  =  C  (/ldlt)  and  Aspd©)  =  D  (/ldlt)  ■  Since  Mmatrix(n/2,  K)  < 
(l/4)Mmatrix(n,  K)  is  assumed  (see  Assumption  6.3.1),  and  2m1  <  Mmatrix(TO,  K)  (see 
Assumption  6.3.2),  the  above  inequalities  become 

Cspd(™)  <  Mmatrix(n/2, K)  +  (n/2)2  +  2Mmatrix(n/2, K)  +  ( n/2 )2  +  2C,SPD(^/2) 

<  2C'spd(«/2)  +  Mmatrix(n,  K)  (6.11) 

■OsPDfyi)  <  O (log2 (n/2))  +  1  +  20(log(n/2))  +  1  +  2£>Spd(?i/2) 

<  2DspD(n/2)  +  ATlog2  n  for  some  K  >  0  (6.12) 

As  a  consequence,  we  have  the  following  results. 

THEOREM  6.5.3  Let  n  be  a  power  of  two.  Then  there  exists  a  circuit  to  compute  the  LDLT 
factorization  of  an  n  X  n  matrix  whose  size  and  depth  satisfy 

C(/ldlt)  —  2Mmatrjx(n,  K) 

D  (/ldlt)  <  0(n  log2  n) 

Proof  From  (6.1 1)  we  have  that 


log  n 

CsPD(n)  <  2J Mmatrix (n/2j,K) 
j=  o 

By  Assumption  6.3.2,  Mmatrix(n/2,  if)  <  (l/4)Mmatrix(n,  AT).  It  follows  by  induction 
that  Mmatrix(n/2:?,  K)  <  (l/4)JMmatrix(?t,  AT),  which  bounds  the  above  sum  by  a  geo¬ 
metric  series  whose  sum  is  at  most  2Mmatrix(n,  AT).  The  bound  on  D  (/ldlt)  follows 

from  the  observation  that  (2c)  (n/2)  log2  (n/2)  +  clog2n  <  cn  log2  n  for  n  >  2  and 
c  >  0.  ■ 

This  result  combined  with  earlier  observations  provides  a  matrix  inversion  algorithm  for 
arbitrary  non-singular  matrices. 
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THEOREM  6.5.4  The  matrix  inverse  function  ,  for  arbitrary  non-singular  n  X  n  matrices 
over  an  arbitrary  field  1Z  can  be  computed  by  an  algebraic  circuit  whose  size  and  depth  satisfy  the 
following  bounds: 

=0(Mmatrix(n,A')) 

D{Ja-')  =  0(nlog2  n) 

Proof  To  invert  a  non-singular  n  X  n  matrix  M  that  is  not  SPD,  form  the  product  P  = 
Mt M  (which  is  SPD)  with  one  instance  of  MULT[n]  and  then  invert  it.  Then  multi¬ 
ply  P-1  by  MT  on  the  right  with  a  second  instance  of  MULT  [n].  To  invert  P,  compute 
its  LDLt  factorization  and  invert  it  by  forming  (TT)  D~x L~l .  Inverting  LDLT  re¬ 
quires  one  application  ofTRLINVfn],  one  application  ofSCALE[n],  and  one  application  of 
MULT[n],  in  addition  to  the  steps  used  to  form  the  factorization.  Thus,  three  applications 
of  MULT  [n]  are  used  in  addition  to  the  factorization  steps.  The  following  bounds  hold: 


c  {Ja-'  )  ^  ^matri fin,  K)  +  n2  <  4. 5Mmatrix (n) 
D  =  O  (n  log2  n)  +  0(log  n)  =  O  (n  log2  n ) 

The  lower  bound  on  C  ^ fj\-i  'j  follows  from  Lemma  6.5.1.  ■ 


6.5.5  Fast  Matrix  Inversion* 


In  this  section  we  present  a  depth-0  (log2  n)  circuit  for  the  inversion  ofnxn  matrices  known 
as  Csanky’s  algorithm,  which  is  based  on  the  method  of  Leverrier.  Since  this  algorithm  uses 
a  number  of  well-known  matrix  functions  and  properties  that  space  precludes  explaining  in 
detail,  advanced  knowledge  of  matrices  and  polynomials  is  required  for  this  section. 

The  determinant  of  an  n  x  n  matrix  A,  det(^4),  is  defined  below  in  terms  of  the  set  of  all 
permutations  7t  of  the  integers  { 1,  2, .  .  . ,  n} .  Here  the  sign  of  7T,  denoted  cr(  7t),  is  the  number 
of  swaps  of  pairs  of  integers  needed  to  realize  7t  from  the  identity  permutation. 

n 

detM)  =  ^(-l)ffWna^W 

7T  i=  1 


Here  H™=1  ai,Tr(i)  is  the  product  •  •  •  ani7r(n).  The  characteristic  polynomial  of  a 

matrix  A,  namely,  <j>A(% )  in  the  variable  x,  is  the  determinant  of  xl  —  A,  where  /  is  the  fix  n 
identity  matrix: 

4>a(x)  =  det(xl  —  A) 

=  Xn  +  Cn-\Xn~X  +  Cn-2Xn~2  H - h  Co 


If  x  is  set  to  zero,  this  equation  implies  that  Co  =  det(— A).  Also,  it  can  be  shown  that 
<f>A(A)  =  0,  a  fact  known  as  the  Cayley-Hamilton  theorem:  A  matrix  satisfies  its  own 
characteristic  polynomial.  This  implies  that 

A  ( An  1  +  cn~\An  2  +  cn-2An  3  +  •  •  •  +  Ci)  =  —cqI 

Thus,  when  Co  f  0  the  inverse  of  A  can  be  computed  from 

A  1  =  —  (A”  1  +  cn_i An  2  +  cn-2-^n  3  +  •  •  •  +  Ci) 

Co 
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Once  the  characteristic  polynomial  of  A  has  been  computed,  its  inverse  can  be  computed 
by  forming  the  n—  1  successive  powers  of  A,  namely,  A,  A2,  A3, . .  . ,  An~l ,  multiplying  them 
by  the  coefficients  of  4>a(x),  and  adding  the  products  together.  These  powers  of  A  can  be 
computed  using  a  prefix  circuit  having  O(n)  instances  of  the  associative  matrix  multiplication 
operator  and  depth  0(log  n)  measured  in  the  number  of  instances  of  this  operator.  We  have 
defined  Mmatrix(n,  K)  to  be  the  size  of  the  smallest  n  X  n  matrix  multiplication  circuit  with 
depth  if  log  n  (Definition  6.3.1).  Thus,  the  successive  powers  of  A  can  be  computed  by  a 
circuit  of  size  0(nMmatr  ix(n,  K ))  and  depth  0(  log2  n).  The  size  bound  can  be  improved  to 
0(y/nMmatr:-lx(n,  K )).  (See  Problem  6.15.) 

To  complete  the  derivation  of  the  Csanky  algorithm  we  must  produce  the  coefficients  of 
the  characteristic  polynomial  of  A.  For  this  we  invoke  Leverrier’s  theorem.  This  theorem  uses 
the  notion  of  the  trace  of  a  matrix  A,  that  is,  the  sum  of  the  elements  on  its  main  diagonal, 
denoted  tr(A). 

THEOREM  6.5.5  (Leverrier)  The  coefficients  of  the  characteristic  polynomial  of  thenxn  matrix 
A  satisfy  the  following  identity,  where  sr  =  tr(Ar)  for  1  <  r  <  n: 


1  0 

51  2 

52  Si 

Sn— 1  '  *  ' 


0 

0 

3 

S2  Si 


Cn—  1 

Si 

Cn—2 

S2 

Cn— 3 

=  - 

S3 

c0 

S n 

(6.13) 


Proof  The  degree-n  characteristic  polynomial  (/)a(x)  of  A  can  be  factored  over  a  field  of 
characteristic  zero.  If  Ai,  A2, . . . ,  A„  are  its  roots,  we  write 


<J 1>a{x )  =  -  A,) 


From  expanding  this  expression,  it  is  clear  that  the  coefficient  c„_  1  of Xn~l  is  —  Ay. 

Similarly,  expanding  det  (xl  —  A),  cn-\  is  the  negative  sum  of  the  diagonal  elements  of  A, 
that  is,  cn_i  =  —tr(A).  It  follows  that  tr(A)  =  Ay. 

The  Ay’s  are  called  the  eigenvalues  of  A,  that  is,  values  such  that  there  exists  an  n-vector 
u  (an  eigenvector)  such  that  Au  =  Ayit.  It  follows  that  Aru  =  XjU.  It  can  be  shown 
that  Ap  .  . .,  \rn  are  precisely  the  eigenvalues  of  Ar,  so  n”=1  (*  “  AP  is  the  characteristic 
polynomial  of  Ar.  Since  sr  =  tr(Ar),  sr  =  ]Cy=i  Ay- 

Let  So  =  1  and  Sfc  =  0  for  k  <  0.  Then,  to  complete  the  proof  of  (6.13),  we  must  show 
that  the  following  identity  holds  for  1  <  ?  <  n: 

Si-lCn-l  +  Si-2Cn-2  +  - - F  S\Cn-i+\  +  «Cn_j  =  — Sj 


Moving  Si  to  the  left-hand  side,  substituting  for  the  traces,  and  using  the  definition  of  the 
characteristic  polynomial  yield 


J=1 


^(Ay)  -  (A"iC„i  +  A"-i-1Cr 


—i—  1  +  •  •  •  +  Ay  Cl  +  Co) 


\  n—i 


=  0 
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Since  =  0,  when  we  substitute  l  for  n  —  i  it  suffices  to  show  the  following  for 

0  <  l  <  n-  1: 


(n-i)cl  =  J2J2jr. 

j= 1  k— 0  3 


(6.14) 


This  identity  can  be  shown  by  induction  using  as  the  base  case  l  =  0  and  the  following  facts 
about  the  derivatives  of  the  characteristic  polynomial  of  A,  which  are  easy  to  establish: 


Co 


=  (-irn^ 


3  =  1 


Ck  = 


dk(/)A{x) 


dxk 


=  (-o‘«,  E*  •  •  -  Ej, 

Jr  A  3s  t=1  * 

The  reader  is  asked  to  show  that  (6.14)  follows  from  these  identities.  (See  Problem  6.17.] 


x—0 


Csanky’s  algorithm  computes  the  traces  of  powers,  namely  the  sr’s,  and  then  inverts  the 
lower  triangular  matrix  given  above,  thereby  solving  for  the  coefficients  of  the  characteristic 
polynomial.  The  coefficients  are  then  used  with  a  prefix  computation,  as  mentioned  earlier,  to 
compute  the  inverse.  Each  of  the  n  sfs  can  be  computed  in  0{n )  steps  once  the  powers  of 
A  have  been  formed  by  the  prefix  computation  described  above.  The  lower  triangular  matrix 
is  non-singular  and  can  be  inverted  by  a  circuit  with  Mmatrix(ti,  K )  operations  and  depth 
0(log2  n),  as  shown  in  Theorem  6.5.1.  The  following  theorem  summarizes  these  results. 

THEOREM  6.5.6  The  matrix  inverse  function  for  non-singular  n  X  n  matrices  over  a  field  of 
characteristic  zero,  >  has  an  algebraic  circuit  whose  size  and  depth  satisfy  the  following  bounds: 

C  (/W)  =  0(nMmatrix(n,  K)) 

=  0(log2  n) 

The  size  bound  can  be  improved  to  0(y/nMmailix(n,  K)),  as  suggested  in  Problems  6.15 
and  6.16. 


6.6  Solving  Linear  Systems 

A  general  linear  system  with  nxn  coefficient  matrix  M ,  n-vector  x  of  unknowns  and  n-vector 
b  is  defined  in  (6.5)  and  repeated  below: 

Mx  =  b 

This  system  can  be  solved  for  x  in  terms  of  M  and  b  using  the  following  steps  when  M  is  not 
SPD.  If  it  is  SPD,  the  first  step  is  unnecessary  and  can  be  eliminated. 

a)  Premultiply  both  sides  by  the  transpose  of  M  to  produce  the  following  linear  system  in 
which  the  coefficient  matrix  MT M  is  SPD: 

MT  Mx  =  MTb  =  b * 
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b)  Compute  the  LDLT  decomposition  of  AIT M. 

c)  Solve  the  system  (6.15)  by  solving  three  subsequent  systems: 


LDLTx  =  b* 

(6.15) 

Lu  =  b* 

(6.16) 

Dv  =  u 

(6.17) 

LTx  =  v 

(6.18) 

Clearly,  Lu  =  LDv  =  LDLTx  =  b* . 

The  vector  b*  is  formed  by  a  matrix-vector  multiplication  that  can  be  done  with  n2  mul¬ 
tiplications  and  n(n  —  1)  additions,  for  a  total  of  2 n2  —  n  operations. 

Since  L  is  unit  lower  triangular,  the  system  (6.1 6)  is  solved  by  forward  elimination.  The 
value  of  U\  is  bfi  The  value  of  U2  is  b*  —  ^2,1  Mi ,  obtained  by  eliminating  u\  from  the  sec¬ 
ond  equation.  Similarly,  on  the  jth  step,  the  values  of  U\,  112,  .  . . ,  Uj- 1  are  known  and  their 
weighted  values  can  be  subtracted  from  b*  to  provide  the  value  of  Uj;  that  is, 

Uj  ~  b’j  —  ljs\U\  —  ljt 2U2  —  *  *  •  —  ljyj—\Uj—\ 

for  1  <  j  <  n.  Here  n{n  —  l)/2  products  are  formed  and  n(n  —  l)/2  subtractions  taken  for 
a  total  of  n(n  —  1)  operations. 

Since  D  is  diagonal,  the  system  (6.17)  is  solved  for  v  by  multiplying  Uj  by  the  multiplica¬ 
tive  inverse  of  djj;  that  is, 

V3  =  U3dJ,j 

for  1  <  j  <  n.  This  is  called  normalization.  Here  n  divisions  are  performed. 

Finally,  the  system  (6.18)  is  solved  for  x  by  backward  substitution,  which  is  forward 
elimination  applied  to  the  elements  of  x  in  reverse  order. 

THEOREM  6.6. 1  Let  /gp^  solve  :  Rn  +n  1 — >  Rn  be  the  (partial)  function  that  computes  the 
solution  to  a  linear  system  of  equations  defined  by  an  n  x  n  symmetric  positive  definite  coefficient 
matrix  M.  Then 

©/sPD_solve)  -  +  Offl2) 

£(/spd_So1  J<C(f{L^LT)  +  0(n) 

If  M  is  not  SPD  but  is  non-singular,  an  additional  0(Mmatrix(n,  K))  circuit  elements  and 
depth  0(log  n)  suffice  to  compute  it. 

6.7  Convolution  and  the  FFT  Algorithm 

The  discrete  Fourier  transform  (DFT)  and  convolution  are  widely  used  techniques  with  im¬ 
portant  applications  in  signal  processing  and  computer  science. 

In  this  section  we  introduce  the  DFT,  describe  the  fast  Fourier  transform  algorithm,  and 
derive  the  convolution  theorem.  The  naive  DFT  algorithm  on  sequences  of  length  n  uses 
0(n2)  operations;  the  fast  Fourier  transform  algorithm  uses  only  0(n  log  n)  operations,  a 
saving  of  a  factor  of  at  least  100  for  n  >  1, 000.  The  convolution  theorem  provides  a  way 
to  use  the  DFT  to  convolve  two  sequences  in  0{n  log  n)  steps,  many  fewer  than  the  naive 
algorithm  for  convolution,  which  uses  0(n2)  steps. 
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6.7.1  Commutative  Rings* 

Since  the  DFT  is  defined  over  commutative  rings  having  an  nth  root  of  unity,  we  digress  briefly 
to  discuss  such  rings.  (Commutative  rings  are  defined  in  Section  6.2.) 

DEFINITION  6.7. 1  A  commutative  ring  TZ  =  ( R,  +,  *,  0,  1)  has  a  principal  nth  root  of  unity 

u>  if  ijJ  £  R  satisfies  the  following  conditions: 

u)n  =  1  (6.19) 

n—  1 

u>lk  =  0  for  each  1  <  l  <  n  —  1  (6.20) 

fc= o 

The  elements  uj°,  uj1,uj2,  . . . ,  u>n~l  are  the  nth  roots  of  unity  and  the  elements  uj°,  ui~  1 ,  u>~2, 
are  the  nth  inverse  roots  of  unity.  (Note  that  us  J  =  un  -1  is  the  multiplicative 
inverse  of  u>i  since  cT usn~3  =  usn  =  l.) 

Two  commutative  rings  that  have  principal  nth  roots  of  unity  are  the  complex  numbers 
and  the  ring  of  integers  modulo  m  =  2 tnl2  +  1  when  t  >  2  and  n  =  29,  as  we  show. 
The  reader  is  asked  to  show  that  7Lm  has  a  principal  nth  root  of  unity,  as  stated  below.  (See 
Problem  6.24.) 

LEMMA  6.7. 1  Let  7Lm  he  the  ring  of  integers  modulo  m  when  m  =  2tn'2  +  1,  t  >  2  and 
n  =  2q.  Then  uj  =  2l  is  a  principal  nth  root  of  unity. 

An  example  of  the  ring  TLm  is  given  by  t  =  2,  n  =  4,  and  m  =  24  +  1  =  17.  In  this 
ring  uj  =  4  is  a  principal  fourth  root  of  unity.  This  is  true  because  UJn  =  44  =  16-16  = 
(16+  1) (16  —  1)  +  1  =  1  mod  (16+1)  and  Ej=o  ^pj  =  ((4p)n  -  l)/(4^  -  1)  mod  (17) 
=  (( 4n)P  -  1)/(4P  -  1)  mod  (17)  =  (lp  -  1)/(4P  -  1)  mod  (17)  =  0  mod  (17). 

LEMMA  6.7.2  e27”/"  =  cos(27t/n)  +  isin(27t/n)  is  a  principal  nth  root  of  unity  over  the 
complex  numbers  where  i  =  '/—l  is  the  “imaginary  unit.  ” 

Proof  The  first  condition  is  satisfied  because  ( e2ni/n)n  =  e 2"  =  1.  Also,  Ell o  ^  = 
(uln  -  1) /(t7  -  1)  =  0  if  1  <  l  <  n  -  1  for  u  =  e2™ln.  ■ 

6.7.2  The  Discrete  Fourier  Transform 

The  discrete  Fourier  transform  has  many  applications.  In  Section  6.7.4  we  see  that  it  can  be 
used  to  compute  the  convolution  of  two  sequences  efficiently,  which  is  the  same  as  computing 
the  coefficients  of  the  product  of  two  polynomials.  The  discrete  Fourier  transform  can  also  be 
used  to  construct  a  fast  algorithm  (circuit)  for  the  multiplication  of  two  binary  integers  [303]. 
It  is  widely  used  in  processing  analog  data  such  as  speech  and  music. 

The  n- point  discrete  Fourier  transform  Fn  :  Rn  i— >  R"  maps  n-tuples  a  =  (ao, 
an- 1)  over  R  to  n-tuples  f  =  (/o,  f\, . . . ,  fn-\)  over  R\  that  is,  Fn(a)  =  f.  The  com¬ 
ponents  of  f  are  defined  as  the  values  of  the  following  polynomial  p(x)  at  the  nth  roots  of 
unity: 


p(x )  =  ao  +  a\x  +  Q2X2  +  •  •  •  +  a„_ \xn  1 


(6.21) 
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Then  fr,  the  rth  component  of  Fn(a),  is  defined  as 


n—  1 

fr  =  p(ur)  =  ]T  akurk  (6.22) 

k= 0 

This  computation  is  equivalent  to  the  following  matrix-vector  multiplication: 

Fn(a)  =  [u©  x  a  (6.23) 

where  ]  is  the  n  x  n  Vandermonde  matrix  whose  i,  j  entry  is  id13 ,  0  <  i,  j  <  n  —  1 ,  and 
a  is  treated  as  a  column  vector. 

The  n-point  inverse  discrete  Fourier  transform  F~ 1  :  Rn  i— >  II"  is  defined  as  the  values 
of  the  following  polynomial  q(x )  at  the  inverse  77th  roots  of  unity: 

q{x)  =  (f0  +  f\X  +  f2x2  H - h  fn-\Xn~l)/n  (6.24) 

That  is,  the  inverse  DFT  maps  an  77-tuple  /  to  an  77-tuple  g,  namely,  F~l{f)  =  g,  where  gs 
is  defined  as  follows: 


=  ^  s)  =  l-  E  f* 


,—ls 


1=0 


This  computation  is  equivalent  to  the  following  matrix-vector  multiplication: 

1 


Fn(f)  = 


—  UJ 

n 


-13 


X  f 


Because  of  the  following  lemma  it  is  legitimate  to  call  Fn  1  the  inverse  of  Fn. 

LEMMA  6.7.3  For  all  a  €  Rn,  a  =  F~l  (Fn(a)) . 

Proof  Let  /  =  Fn(a)  and  g  =  F~  (/).  Then  gs  satisfies  the  following: 


n—  1  n— 1 n— 1 

=  -J2f^~ls  =  -EE«^(fc_s)z 

71  '  11  ' 


1=0 

n— 1  ,  n— 1 


1=0  k= 0 


Z^Z“(k-s)l 


k= 0  /— 0 

=  Gjo 


(6.25) 


The  second  equation  results  from  a  change  in  the  order  of  summation.  The  last  follows 
from  the  definition  of  77th  roots  of  unity.  It  follows  that  the  matrix  / n]  is  the  inverse 
of[u©.  ■ 

The  computation  of  the  77-point  DFT  and  its  inverse  using  the  naive  algorithms  suggested 
by  their  definitions  requires  0(n2)  steps.  Below  we  show  that  a  fast  DFT  algorithm  exists  for 
which  only  0(n  log  n)  steps  suffice. 
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6.7.3  Fast  Fourier  Transform 

The  fast  Fourier  transform  algorithm  is  a  consequence  of  the  following  observation:  when 
n  is  even,  the  polynomial  p(x)  in  equation  (6.21)  can  be  decomposed  as 

p(x)  =  ao  +  a\X  +  CL2X2  +  •  •  •  +  an- \Xn~l 
=  (a0  +  a2x2  +  •  •  •  +  an_ 2xn  2) 

T  X  (dj  -|-  CL^X2  T  •  •  •  T  CLn—\Xn  2) 

=  Pe{x2)  +  xp0(x2)  (6.26) 

Her e  pe(y)  and  p0(y)  are  polynomials  of  degree  (n/2)  —  1. 

Let  n  be  a  power  of  2,  that  is,  n  =  2d.  As  stated  above,  the  n-point  DFT  of  a  is 
obtained  by  evaluating  p(x)  at  the  nth  roots  of  unity.  Because  of  the  decomposition  of 
p(x),  it  suffices  to  evaluate  pe(y)  and  p0{y)  at  y  =  (w0)2,  (w1)2,  (w2)2,  ■ . . ,  ( un~ l)2  = 
(a;2)0,  (w2)1,  (w2)2, .  • . ,  (a;2)71-1  and  combine  their  values  with  one  multiplication  and  one 
addition  for  each  of  the  n  roots  of  unity.  However,  because  u>2  is  a  (n/2)th  principal  root 
of  unity  (see  Problem  6.25),  ( ui2)^n/2'>+r  =  (u>2)r  and  the  n  powers  of  io2  collapse  to  n/2 
distinct  powers  of  u)2 ,  namely,  the  (n/2)th  roots  of  unity.  Thus,  p{x)  at  the  nth  roots  of  unity 
can  be  evaluated  by  evaluating  pe(y)  and  p0(y)  at  the  (n/2)th  roots  of  unity  and  combining 
their  values  with  one  addition  and  multiplication  for  each  of  the  nth  roots  of  unity.  In  other 
words,  the  n-point  DFT  of  a  can  be  done  by  performing  the  (n/2) -point  DFT  of  its  even 
and  odd  subsequences  and  combining  the  results  with  O(n)  additional  steps.  This  is  the  fast 
Fourier  transform  (FFT)  algorithm. 

We  denote  by  F ^  the  directed  acyclic  graph  associated  with  the  straight-line  program 
resulting  from  this  realization  of  the  FFT  on  n  =  2d  inputs.  A  circuit  for  the  16-point  FFT 
algorithm  inputs,  F^\  is  shown  in  Fig.  6.7.  It  is  computed  from  the  eight-point  FFT  on 
the  even  and  odd  components  of  a,  as  shown  in  the  boxed  regions.  These  components  are 
permuted  because  each  of  these  smaller  FFTs  is  computed  recursively  in  turn.  (The  index  of 


/o  /i  fi  fi  fa  fa  fa  fa  fa  fa  ho  hi  fl2  hi  fl4  fl5 


CLq  (3.8  &4  Ci\2  CL  2  &i()  ^6  CL  14  CL\  CLy  CI5  CZ.  13  Q.3  flu  fl.7  fli5 

Figure  6.7  A  circuit  F for  the  FFT  algorithm  on  16  inputs. 
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the  ith  input  vertex  from  the  left  is  obtained  by  writing  the  integer  i  as  a  binary  number, 
reversing  the  bits,  and  converting  the  resulting  binary  number  to  an  integer.  This  is  called  the 
bit-reverse  permutation  of  the  binary  representation  of  the  integer.  For  example,  the  third 
input  from  the  left  has  index  3,  which  is  (01 1)  in  binary.  Reversed,  the  binary  number  is  (1 10), 
which  represents  12.)  Inputs  are  associated  with  the  open  vertices  at  the  bottom  of  the  graph. 
Each  vertex  except  for  input  vertices  is  associated  with  an  addition  and  a  multiplication.  For 
example,  the  white  vertex  at  the  top  of  the  graph  computes  fg  =  pe((uj8)2)  +  w8p„((w8)2), 
where  (w8)2  =  w16  =  w. 

Let  C(F^df  and  D(F^)  be  the  size  and  depth  of  circuits  for  the  2d-point  FFT  algorithm 
for  integer  d  >  1 .  The  construction  given  above  leads  to  the  following  recurrences  for  these 
two  measures: 

C  ( 'f (d))  <  2 C  (fV-'A  +  2d+l 
D  (V(d))  <  D  +2 

Also,  examination  of  the  base  case  of  n  =  2  demonstrates  that  C  ( F (©  =  3  and  D  ( F l1!)  = 
2,  from  which  we  have  the  following  theorem. 

THEOREM  6.7. 1  Let  n  =  2d.  The  circuit  for  the  n-point  FFT  algorithm  over  a  commutative 
ring  1Z  has  the  following  circuit  size  and  depth  bounds: 

C  <  2nlogn 

D(F^  <2  log  n 

The  FFT  graph  is  used  in  later  chapters  to  illustrate  tradeoffs  between  space  and  time,  space 
and  the  number  of  I/O  operations,  and  area  and  time  for  computation  with  VLSI  machines. 
For  each  of  these  applications  we  decompose  the  FFT  graph  into  sub-FFT  graphs.  One  such 
decomposition  is  shown  in  Fig.  6.7.  A  more  general  decomposition  is  shown  in  Fig.  6.8  and 
described  below. 

LEMMA  6.7.4  The  2d-point  FFT  graph  F^  can  be  decomposed  into  2e  2d~e-point  bottom 
FFT  graphs,  {F^d-  |  1  <  j  <  2ej,  and2d~e  2e  -point  top  FFT  graphs,  {F^)  |  1  <  j  < 
2  d~ej.  The  ith  input  of F^  is  the  jth  output  of F^d  e\ 

In  Fig.  6.8  the  vertices  and  edges  have  been  grouped  together  as  recognizable  FFT  graphs 
and  surrounded  by  shaded  boxes.  The  edges  between  boxes  are  not  edges  of  the  FFT  graph  but 
instead  are  used  to  identify  vertices  that  are  simultaneously  outputs  of  bottom  FFT  subgraphs 
and  inputs  to  top  FFT  subgraphs. 

COROLLARY  6.7. 1  F (t2)  can  be  decomposed  into  \d/e J  stages  each  containing  2d  e  copies  of 
F and  one  stage  containing  2d~k  copies  of  F^k\  k  =  d  —  \d/e\  e.  (F(  0)  is  a  single  vertex.) 
The  output  vertices  of  one  stage  are  the  input  vertices  to  the  next. 

Proof  From  Lemma  6.7.4,  each  of  the  2e  bottom  FFT  subgraphs  F^d~e^  can  be  further 
decomposed  into  2rf-2e  top  FFT  subgraphs  F ^  and  2e  bottom  FFT  subgraphs  F^d~le\ 
By  repeating  this  process  t  times,  t  <  d/e,  can  be  decomposed  into  t  stages  each 
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Figure  6.8  Decomposition  of  the  32-point  FFT  graph  F ^  into  four  copies  of  C3'  and  8 
copies  of  F T  The  edges  between  bottom  and  top  sub-FFT  graphs  do  not  exist  in  the  FFT 
graph.  They  are  used  here  to  identify  common  vertices  and  highlight  the  communication  needed 
among  sub-FFT  graphs. 


containing  2d  e  copies  of  F ^  and  one  stage  containing  2d  te  copies  of  F^d  te\  The 
result  follows  by  setting  t  =  [d/e J .  ■ 

6.7.4  Convolution  Theorem 

The  convolution  function  /ion™'*  :  Rn+m  i— >  Rn+m— 1  over  a  commutative  ring  1Z  maps  an 
n-tuple  a  =  (ao,  a\, . . . ,  ara_i)  and  an  m-tuple  b  =  (bo,  b\, . . . ,  &m-i)  onto  an  ( n+m —  1)- 
tuple  c,  denoted  c  =  a  ®  b,  where  Cj  is  defined  as  follows: 

Cj  =  ar  *  bs  for  0  <  j  <  n  +  m  —  2 

r-\-s=j 

Here  T  an<i  *  are  addition  and  multiplication  over  the  ring  1Z.  The  direct  computation  of  the 
convolution  function  using  the  above  formula  takes  0(nm)  steps.  The  convolution  theorem 
given  below  and  the  fast  Fourier  transform  algorithm  described  above  allow  the  convolution 
function  to  be  computed  in  0(n  log  n)  steps  when  n  =  m. 

Associate  with  a  and  b  the  following  polynomials  in  the  variable  x : 

a(x)  =  ao  +  a\X  +  a-ix2  +  •  •  •  +  an-\Xn~l 
b(x)  =  b0  +  b\X  +  b2x2  +  •  •  •  +  bn_\Xn~l 

Then  the  coefficient  of  the  term  xJ  in  the  product  polynomial  c(a;)  =  a(x)b(x)  is  clearly  the 
term  Cj  in  the  convolution  c  =  a  ®  b. 

Convolution  is  used  in  signal  processing  and  integer  multiplication.  In  signal  processing, 
convolution  describes  the  results  of  passing  a  signal  through  a  linear  filter.  In  binary  integer 
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multiplication  the  polynomials  a(2)  and  b( 2)  represent  binary  numbers;  convolution  is  related 
to  the  computation  of  their  product. 

The  convolution  theorem  is  one  of  the  most  important  applications  of  the  DFT.  It 
demonstrates  that  convolution,  which  appears  to  require  0(n 2)  operations  when  n  =  m, 
can  in  fact  be  computed  by  a  circuit  with  0{n)  operations  plus  a  small  multiple  of  the  number 
needed  to  compute  the  DFT  and  its  inverse. 

THEOREM  6.7.2  Let  1Z  =  (R,  +,  *,0, 1)  be  a  commutative  ring  and  let  a,  b  £  Rn.  Let 
F2n  '■  R2n  >— 1 •  R2n  and  Fff  :  R2n  t— >  Rln  be  the  2n-point  DFT  and  its  inverse  over  R.  Let 
Finict)  x  F2n(b)  denote  the  2n-tuple  obtained  from  the  term-by-term  product  of  the  components 
of  F2n(a)  and  F2n{b).  Then,  the  convolution  a  ®  b  satisfies  the  following  identity: 

a®b=  Fffr (F2n (a)  x  F2n(b)) 

Proof  The  n-point  DFT  Fn  ;  Rn  i— >  Rn  transforms  the  n-tuple  of  coefficients  a  of  the 
polynomial  p(x)  of  degree  n  —  1  into  the  n-tuple  /  =  Fn(a).  In  fact,  the  rth  component 
of  /,  fr,  is  the  value  of  the  polynomial  p(x)  at  the  rth  of  the  n  roots  of  unity,  namely 
fr  =  p{u>r).  The  n- point  inverse  DFT  F~l  :  Rn  i— >  Rn  inverts  the  process  through  a 
similar  computation.  If  q(x)  is  the  polynomial  of  degree  n  —  1  whose  1th  coefficient  is  fi /n, 
then  the  sth  component  of  the  inverse  DFT  on  /,  namely  is  as  =  q(uj~s). 

As  stated  above,  to  compute  the  convolution  of  n-tuples  a  and  b  it  suffices  to  compute 
the  coefficients  of  the  product  polynomial  c(;r)  =  a(x)b(x).  Since  the  product  c( x)  is  of 
degree  2 n  —  2,  we  can  treat  it  as  a  polynomial  of  degree  2n  —  1  and  take  the  2n-point 
DFT,  F2n,  of  it  and  its  inverse,  Ffn  ,  of  the  result.  This  seemingly  futile  process  leads  to  an 
efficient  algorithm  for  convolution.  Since  the  DFT  is  obtained  by  evaluating  a  polynomial 


Figure  6.9  The  DAG  associated  with  the  straight-line  program  resulting  from  the  application 
of  the  FFT  to  the  convolution  theorem  with  sequences  of  length  8. 
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at  the  n  roots  of  unity,  the  DFT  of  c( x)  can  be  done  at  the  In  roots  of  unity  by  evaluating 
a[x)  and  b{x)  at  the  2 n  roots  of  unity  (that  is,  computing  the  DFTs  of  their  coefficients 
as  if  they  had  degree  2n  —  1),  multiplying  their  values  together,  and  taking  the  2n-point 
inverse  DFT,  that  is,  performing  the  computation  stated  in  the  theorem.  ■ 

The  combination  of  the  convolution  theorem  and  the  algorithm  for  the  FFT  provides  a 
fast  straight-line  program  for  convolution,  as  stated  below.  The  directed  acyclic  graph  for  this 
straight-line  program  is  shown  in  Fig.  6.9  on  page  269. 

THEOREM  6.7.3  Let  n  =  2d.  The  convolution  function  f  mnv  ■  R2n  t— >  over  a 

commutative  ring  1Z  can  be  computed  by  a  straight-line  program  over  TZ  with  size  and  depth 
satisfying  the  following  bounds: 


C  (/ionv})  <  12nlog2n 
D  (f££>)  <4  log  2 n 

6.8  Merging  and  Sorting  Networks 

The  sorting  problem  is  to  put  into  ascending  or  descending  order  a  collection  of  items  that 
are  drawn  from  a  totally  ordered  set.  A  set  is  totally  ordered  if  for  every  two  distinct  elements 
of  the  set  one  is  larger  than  the  other.  The  merging  problem  is  to  merge  two  sorted  lists  into 
one  sorted  list.  Sorting  and  merging  algorithms  can  be  either  straight-line  or  non-straight-line. 
An  example  of  a  non-straight-line  merging  algorithm  is  the  following: 

Create  a  new  sorted  list  from  two  sorted  lists  by  removing  the  smaller  item  from  the 
two  lists  and  appending  it  to  the  new  list  until  one  list  is  empty,  at  which  point  append 
the  non-empty  list  to  the  end  of  the  new  list. 

The  binary  sorting  function  '■  1 — >  £>"  described  in  Section  2.11  sorts  a  Boolean  n- 

tuple  into  descending  order.  The  combinational  circuit  given  there  is  an  example  of  a  straight- 
line  sorting  network,  a  network  realized  by  a  straight-line  program.  When  the  set  of  elements 
to  be  sorted  is  not  Boolean,  sorting  networks  can  become  quite  a  bit  more  complicated,  as  we 
see  below. 

In  this  section  we  describe  sorting  networks,  circuits  constructed  from  comparator  oper¬ 
ators  that  take  n  elements  drawn  from  a  finite  totally  ordered  set  A  and  put  them  into  sorted 
order.  A  comparator  function  ®  :  A2  *— >  A2  with  arguments  a  and  b  returns  their  maximum 
and  minimum;  that  is,  ®(a,  b)  =  (max(a,  b),  min(a,  &)). 

It  is  convenient  to  show  a  comparator  operator  as  a  vertical  edge  between  two  lines  carrying 
values,  as  in  Fig.  6.10(a).  The  values  on  the  two  lines  to  the  right  of  the  edge  are  the  values  to 
its  left  in  sorted  order,  the  smaller  being  on  the  upper  line.  A  sorting  network  is  an  example 
of  a  comparator  network,  a  circuit  in  which  the  only  operator  is  a  comparator.  Input  values 
appear  on  the  left  and  output  values  appear  on  the  right  in  sorted  order. 

Shown  in  Fig.  6.10(b)  is  an  insertion-sorting  network  on  five  inputs  that  inserts  an  ele¬ 
ment  into  a  previously  sorted  sublist.  Two  inputs  are  sorted  at  the  wavefront  labeled  A.  Between 
wavefronts  A  and  B  a  new  item  is  inserted  that  is  compared  against  the  previously  sorted  sublist 
and  inserted  into  its  proper  position.  The  same  occurs  between  wavefronts  B  and  C  and  after 
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min©  b) 


max©  b) 


(a) 


(b) 


Figure  6. 10  (a)  A  comparison  operator,  and  (b)  an  insertion-sorting  network. 


wavefront  C.  An  insertion-sorting  network  can  be  realized  with  one  comparator  for  the  first 
two  inputs  and  k  —  1  more  for  the  /tth  input,  3  <  k  <  n.  Let  Cinsert©)  and  ^insert  ©) 
denote  the  size  and  depth  of  an  insertion-sorting  network  on  n  elements.  Then  C( 2)  =  1  and 
D{ 2)  =  1,  and 


©nsert©)  if  ©insert©  1 )  T  Tl  1  —  77©  l)/2 

-©nsert (77)  if  max(-©nsert (77  l)  T  1,77.  1)  =  77  1 

The  depth  bound  follows  because  there  is  a  path  of  length  ?7—  1  through  the  chain  of  compara¬ 
tors  added  at  the  last  wavefront  and  every  path  through  the  sorting  network  is  extended  by  one 
comparator  with  the  addition  of  the  new  wavefront.  A  simple  proof  by  induction  establishes 
these  results. 

6.8. 1  Sorting  Via  Bitonic  Merging 

We  now  describe  Batcher’s  bitonic  merging  network  which  is  the  basis  for  a  sorting 

network.  Let  x  =  (xi,  X2,  ■  ■  ■ ,  xm)  and  y  =  ■  ■  ■  ,ym)  be  ordered  sequences  of 

length  m.  That  is,  Xj  <  2^+1  and  yj  <  7/j+i.  As  suggested  in  Fig.  6.11,  the  even-indexed 
components  of  x  are  merged  with  the  odd-indexed  components  of  y,  as  are  the  odd-indexed 
components  of  x  and  the  even-indexed  components  of  y.  Each  of  the  four  lists  that  are  merged 
are  themselves  sorted.  The  two  lists  are  interleaved  and  the  fcth  and  ( k+  l)st  elements,  k  even, 
are  compared  and  swapped  if  necessary.  To  prove  correctness  of  this  circuit,  we  use  the  zero-one 
principle  which  is  stated  below  for  sorting  networks  but  applied  later  to  merging  networks. 

THEOREM  6.8. 1  (Zero-one  principle)  If  a  comparator  network  for  inputs  over  a  set  A  correctly 
sorts  all  binary  inputs ,  it  correctly  sorts  all  inputs. 

Proof  The  proof  is  by  contradiction.  Suppose  the  network  correctly  sorts  all  0- 1  sequences 
but  fails  to  sort  the  input  sequence  ©,  02, ... ,  an).  Then  there  are  inputs  a,;  and  cij  such 
that  a,  <  dj  but  the  network  puts  dj  before  a^. 

Since  a  sorting  network  contains  only  comparators,  if  we  replace  each  entry  ar  in  an 
input  sequence  ©,  02, . . . ,  an)  with  a  new  entry  h(ar),  where  h{a)  is  monotonically 
non-decreasing  in  a  ( h[a )  is  non-decreasing  as  a  increases),  each  comparison  of  entries 
ar  and  as  is  replaced  by  a  comparison  of  entries  h(ar )  and  h(as).  Since  ar  <  as  only 
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Figure  6.1  I  A  recursive  construction  of  the  bitonic  merging  network  BM(4).  The  even- 
indexed  elements  of  one  sorted  sequence  are  merged  with  the  odd-indexed  elements  of  the  other, 
the  resulting  sequences  interleaved,  and  the  even-  and  succeeding  odd-indexed  elements  com¬ 
pared.  The  inputs  of  one  sequence  are  permuted  to  demonstrate  that  BM( 4)  uses  two  copies  of 
BM(  2). 


if  h(ar)  <  h(as),  the  set  of  comparisons  made  by  the  sorting  network  will  be  exactly 
the  same  on  (ai,  a2, ... ,  an )  as  on  (/i(a i),  h(a2 ), . . . ,  h(an)).  Thus,  the  original  output 
(&i,  62, ... ,  bn )  will  be  replaced  by  the  output  sequence  ( h(b\ ),  h(b2), . . . ,  h{bn)). 

Since  it  is  presumed  that  the  comparator  network  puts  a,  and  a j  in  the  incorrect  order, 
let  2i( x)  be  the  following  monotone  function: 


h(x ) 


0  if  X  <  CLi 

1  if  X  >  CLi 


Then  the  input  and  output  sequences  to  the  comparator  network  are  binary.  However, 
the  output  sequence  is  not  sorted  ( CLj  appears  before  but  h(cij)  =  1  and  h(cii)  =  0), 
contradicting  the  hypothesis  of  the  theorem.  It  follows  that  all  sequences  over  A  must  be 
sorted  correctly.  ■ 


We  now  show  that  Batcher’s  bitonic  merging  circuit  correctly  merges  two  sorted  lists.  If 
a  correct  m-sorter  exists,  then  a  correct  2m-sorter  can  be  constructed  by  combining  two  m- 
sorters  with  a  correct  2m-input  bitonic  merging  circuit.  It  follows  that  a  correct  2m-input 
bitonic  merging  circuit  exists  if  and  only  if  the  resulting  sorting  network  is  correct.  This  is 
the  core  idea  in  a  proof  by  induction  of  correctness  of  the  2m-input  bitonic  merging  circuit. 
The  basis  for  induction  is  the  fact  that  individual  comparators  correctly  sort  sequences  of  two 
elements. 

Suppose  that  x  and  y  are  sorted  0—1  sequences  of  length  m.  Let  x  have  k  0’s  and 
m  —  k  l’s,  and  let  y  have  l  0’s  and  m  —  l  Is.  Then  the  leftmost  merging  network  of  Fig.  6.11 
selects  exactly  \k/2]  0’s  from  x  and  \  l/2 \  0’s  from  y  to  produce  the  sequence  u  consisting  of 
a  =  \k/2]  +  |_Z / 2 j  0’s  followed  by  l’s.  Similarly,  the  rightmost  merging  network  produces 
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the  sequence  v  consisting  of  b  =  [k/ 2j  +  [7/2]  0’s  followed  by  l’s.  Since  [a;]  —  \  x\  is  0  or 
1,  it  follows  that  either  a  =  b,  a  =  b  —  1,  or  a  =  6  +  1.  Thus,  when  u  and  v  are  interleaved 
to  produce  the  sequence  z  it  contains  a  sequence  of  a  +  b  0’s  followed  by  l’s  when  a  =  b  or 
a  =  b  +  1,  or  2a  0’s  followed  by  1  0  followed  by  l’s  when  a  =  b  —  1,  as  suggested  below: 

2  a 

z=oT.,o;i,o,i,...,i 

Thus,  if  for  each  0  <  k  <  m  —  1  the  outputs  in  positions  2k  and  2k  +  1  are  compared  and 
swapped,  if  necessary,  the  output  will  be  properly  sorted. 

The  graph  of  BM(A )  of  Fig.  6.1 1  illustrates  that  BM(A)  is  constructed  of  two  copies  of 
BM( 2).  In  addition,  it  demonstrates  that  the  operations  of  each  of  the  two  BM{2)  subnet¬ 
works  can  be  performed  in  parallel.  Another  important  observation  is  that  this  graph  is  iso¬ 
morphic  to  an  FFT  graph  when  the  comparators  are  replaced  by  two-input  butterfly  graphs, 
as  shown  in  Fig.  6.12. 

THEOREM  6.8.2  Batcher’s  2n-input  bitonic  merging  circuit  BM(n )  for  merging  two  sorted  n- 
sequences,  n  =  2k,  has  the  following  size  and  depth  bounds  over  the  basis  fi  of  comparators: 

Cn(BM(n ))  <  n(logn+  1) 

Dn(BM(n))  <  logn+  1 

Proof  Let  C(k)  and  D(k)  be  the  size  and  depth  of  BM(n).  Then  (7(0)  =  1,  D{ 0)  =  1, 
C(k)  =  2C{k  —  1)  +  2k,  and  D(k)  =  D(k  —  1)  +  1.  It  follows  that  C(k)  =  (k  +  l)2fc 
and  D(k)  =  k  +  1.  (See  Problem  6.29.)  ■ 

This  leads  us  to  the  recursive  construction  of  a  Batcher’s  bitonic  sorting  network  BS(n ) 
for  sequences  of  length  n,  n  =  2k.  It  merges  the  output  of  two  copies  of  BS(n/ 2)  using 
a  copy  of  Batcher’s  n-input  bitonic  merging  circuit  BM(n/ 2).  The  proof  of  the  following 
theorem  is  left  as  an  exercise.  (See  Problem  6.28.) 

THEOREM  6.8.3  Batcher’s  n-input  bitonic  sorting  circuit  BS(n)  for  ri  =  2k  has  the  following 
size  and  depth  bounds  over  the  basis  Q  of  comparators: 

Cn(BS(n))  =  ^  [log2  n  +  log  n] 
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Figure  6. 12  The  graph  resulting  from  the  replacement  of  comparators  in  Fig.  6.1 1  with  two- 
input  butterfly  graphs  and  the  permutation  of  inputs.  All  edges  are  directed  from  left  to  right. 
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Dsi{BS{n))  =  ^  logn(logn  —  1) 

6.8.2  Fast  Sorting  Networks 

Ajtai,  Komlos,  and  Szemeredi  [14]  have  shown  the  existence  of  a  sorting  network  (known 
as  the  AKS  sorting  network)  on  n  inputs  whose  circuit  size  and  depth  are  O(nlogn)  and 
O(logn),  respectively.  The  question  had  been  open  for  many  years  whether  such  a  sorting 
network  existed.  Prior  to  [14]  it  was  thought  that  sorting  networks  required  S2(log2  n)  depth. 


Problems 

MATHEMATICAL  PRELIMINARIES 

6. 1  Show  that  (Z,  +,  *,  0, 1 )  is  a  commutative  ring,  where  +  and  *  denote  integer  addition 
and  multiplication  and  0  and  1  denote  the  first  two  integers. 

6.2  Let  Zjp  be  the  set  of  integers  modulo  p,  p  >  0,  under  addition  and  multiplication 
modulo  p  with  additive  identity  0  and  multiplicative  identity  1 .  Show  that  7Lp  is  a  ring. 

6.3  A  field  T  is  a  commutative  ring  in  which  each  element  other  than  0  has  a  multiplicative 
inverse.  Show  that  (Zp,  +,  *,  0,  1)  is  a  field  when  p  is  a  prime. 

MATRICES 

6.4  Let  Mnxn  be  the  set  of  n  X  n  matrices  over  a  ring  1Z.  Show  that  (Mnxn,  +„,  Xn,  0ra, 
/„)  is  a  ring,  where  +n  and  X„  are  the  matrix  addition  and  multiplication  operators 
and  0„  and  In  are  the  n  x  n  zero  and  identity  matrices. 

6.5  Show  that  the  maximum  number  of  linearly  independent  rows  and  of  linearly  indepen¬ 
dent  columns  of  an  n  x  m  matrix  A  over  a  field  are  the  same. 

Hint:  Use  the  fact  that  permuting  the  rows  and/or  columns  of  A  and  adding  a  scalar 
product  of  one  row  (column)  of  A  to  any  other  row  (column)  does  not  change  its  rank. 
Use  row  and  column  permutations  as  well  as  additions  of  scalar  products  to  rows  and/or 
columns  of  A  to  transform  A  into  a  matrix  that  contains  the  largest  possible  identity 
matrix  in  its  upper  left-hand  corner.  This  is  called  Gaussian  elimination. 

6.6  Show  that  (  AB)t  =  Bt  At  for  all  m  x  n  matrices  A  and  n  X  p  matrices  B  over  a 
commutative  ring  1Z. 

MATRIX  MULTIPLICATION 

6.7  The  standard  matrix-vector  multiplication  algorithm  for  a  general  nXn  matrix  requires 
0(n2)  operations.  Show  that  at  most  O(nlog2  3)  operations  are  needed  when  the  matrix 
is  Toeplitz. 

Hint:  Assume  that  n  is  a  power  of  two  and  treat  the  matrix  as  a  2  x  2  matrix  of 
n/2  X  n/2  matrices.  Also  note  that  only  2 n  —  1  values  determine  all  the  entries  in  a 
Toeplitz  matrix.  Thus,  the  difference  between  two  n  X  n  Toeplitz  matrices  does  not 
require  n 2  operations. 
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6.8  Generalize  Strassen’s  matrix  multiplication  algorithm  to  matrices  that  are  m  x  m  for 
m  =  p2k,  p  and  k  both  integers.  Derive  bounds  on  the  size  and  depth  of  a  circuit 
realizing  this  version  of  the  algorithm. 

For  arbitrary  n,  show  how  n  X  n  matrices  can  be  embedded  into  m  x  m  matrices, 
m  =  p2k ,  so  that  this  new  version  of  the  algorithm  can  be  used.  Show  that  upper 
bounds  of  4 .77nl°Sl7  and  O(logn)  on  the  size  and  depth  of  this  algorithm  can  be 
obtained. 

6.9  Show  that  Strassen’s  matrix  multiplication  algorithm  can  be  used  to  multiply  square 
Boolean  matrices  by  replacing  OR  by  addition  modulo  n  +  1.  Derive  a  bound  on  the 
size  and  depth  of  a  circuit  to  realize  this  algorithm. 

6.10  Show  that,  when  one  of  two  n  X  n  Boolean  matrices  A  and  B  is  fixed  and  known  in 
advance,  A  and  B  can  be  multiplied  by  a  circuit  with  0(n3/logn)  operations  and 
depth  O(logn)  to  produce  the  product  C  =  AB  using  the  information  provided 
below. 

a)  Multiplication  of  A  and  B  is  equivalent  to  n  multiplications  of  A  with  an  n  X  1 
vector  x,  a  column  of  B. 

b)  Since  A  is  a  0  —  1  matrix,  the  product  Ax  consists  of  sums  of  variables  in  x. 

c)  The  product  Ax  can  be  further  decomposed  into  the  sum  AiX\  +  A2x2  +  •  •  •  + 
AkXk  where  k  =  [n/[logn~|~|,  Aj  is  the  n  X  [logn]  submatrix  consisting  of 
columns  (J  —  l)[logn]  +  1  through  j[logn]  of  A,  and  Xj  is  the  jth  set  of 
[logn]  rows  (variables)  in  x. 

d)  There  are  at  most  n  distinct  sums  of  [log  n]  variables  each  of  which  can  be  formed 
in  at  most  2 n  addition  steps,  thereby  saving  a  factor  of  [log  n] . 

TRANSITIVE  CLOSURE 

6.11  Let  A  =  [a,j],  1  <  i,j  <  n,  be  a  Boolean  matrix  that  is  the  adjacency  matrix  of 
a  directed  graph  G  =  (V,E)  on  n  =  \V\  vertices.  Give  a  proof  by  induction  that 
the  entry  in  the  rth  row  and  sth  column  of  Ak  =  Ak~~l  X  A  is  1  if  there  is  a  path 
containing  k  edges  from  vertex  r  to  vertex  s  and  0  otherwise. 

6.12  Consider  a  directed  graph  G  =  (V,  E)  in  which  each  edge  carries  a  label  drawn  from 
a  semiring.  Let  the  entry  in  the  zth  row  and  j'th  column  of  the  adjacency  matrix  of  G 
contain  the  label  of  the  edge  between  vertices  i  and  j  if  there  is  such  an  edge  and  the 
empty  set  otherwise.  Assume  that  the  labels  of  edges  in  G  are  drawn  from  a  semiring. 
Show  that  Theorems  6.4.1,  6.4.2,  and  6.4.3  hold  for  such  labeled  graphs. 

MATRIX  INVERSION 

6.13  Show  that  over  fields  the  following  properties  hold  for  matrix  inversion: 

a)  The  inverse  of  a  2  X  2  upper  (lower)  triangular  matrix  is  the  matrix  with  the  off- 
diagonal  term  negated. 

b)  The  inverse  of  a  2  X  2  diagonal  matrix  is  a  diagonal  matrix  in  which  the  *th  diagonal 
element  is  the  multiplicative  inverse  of  the  zth  diagonal  element  of  the  original 
matrix. 
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6.14  Show  that  a  lower  triangular  Toeplitz  matrix  T  can  be  inverted  by  a  circuit  of  size 
0(n  log  n)  and  depth  0(log2  n). 

Hint:  Assume  that  n  =  2k,  write  T  as  a  2  X  2  matrix  of  n/2  x  n/2  matrices,  and 
devise  a  recursive  algorithm  to  invert  T. 

6.15  Exhibit  a  circuit  to  compute  the  characteristic  polynomial  (/)a(x)  of  an  n  x  n  matrix 
A  over  a  field  72  that  has  0(max(n3,  v^-ATmatrix^)))  field  operations  and  depth 
0(  log2  n). 

Hint:  Consider  the  case  n  =  k2.  Represent  the  integer  i,  0  <  i  <  n—  l,by  the  unique 
pair  of  integers  (r,  s),  0  <  r,  s  <  k  —  1,  where  i  =  rk  +  s.  Represent  the  coefficient 
Ci+\,  0  <  i  <  n  —  2,  of  4>a{x)  by  Cr>s.  Then  we  can  write  4>a{%)  as  follows: 

fc-i 

Mx)  =  J2Ark 

r= 0 

Show  that  it  suffices  to  perform  k2n2  =  n3  scalar  multiplications  and  k(k—  l)n2  <  n3 
additions  to  form  the  inner  sums,  k  multiplications  of  n  x  n  matrices,  and  kn2  scalar 
additions  to  combine  these  products.  In  addition,  A2,  A3, . . . ,  Ak~l and  Ak,  A2k, . . . , 
must  be  computed. 

6.16  Show  that  the  traces  of  powers,  sr,  1  <  r  <  n,  for  annxn  matrix  A  over  a  field  can 
be  computed  with  0(y/nMmatrix(n))  operations. 

Hint:  By  definition  sr  =  a<jj<  where  a^j  is  the  jth  diagonal  term  of  the  matrix 

Ar .  Let  n  be  a  square.  Represent  r  uniquely  by  a  pair  (a,  b),  where  1  <  a,  b  <  y/n—  1 
and  r  =  a^/n  +  b.  Then  Ar  =  Aa^Ab.  Thus,  can  be  computed  as  the  product 
of  the  jth  row  of  Aa 'A™  with  the  jth  column  of  Ab.  Then,  for  each  j,  1  <  j  <  n, 
form  the  \fn  X  n  matrix  Rj  whose  ath  row  is  the  jth  row  of  Aa'/™,  0  <  a  <  —  1. 

Also  form  the  n  x  yjn  matrix  Cj  whose  &th  column  is  the  jth  column  of  Ab ,  1  <  b  < 
yfn  —  1.  Show  that  the  product  RjCj  contains  each  of  the  terms  for  all  values 
ofr,  0<r<n  —  1  and  that  the  products  RjCj,  1  <  j  <  n,  can  be  computed 
efficiently. 

6. 17  Show  that  (6. 14)  holds  by  applying  the  properties  of  the  coefficients  of  the  characteristic 
polynomial  of  an  n  X  n  matrix  stated  in  (6.15). 

Hint:  Use  proof  by  induction  on  l  to  establish  (6.14). 


CONVOLUTION 

6.18  Consider  the  convolution  fmnv1  :  Rn+m  i— >  p>n+m-2  Q£  an  n_tUp]e  a  an  m_ 
tuple  b  when  n  <C  m.  Develop  a  circuit  for  this  problem  whose  size  is  O(mlogn) 
that  uses  the  convolution  theorem  multiple  times. 

Hint:  Represent  the  m-tuple  b  as  sequence  of  \m/n\  n-tuples. 

6.19  The  wrapped  convolution  /^"apped  :  lZ2n  i— >  72."  maps  n-tuples  a  =  (ao,  a i, , 
an-i)  and  b  =  (bo,  b\, . . . ,  frra_i),  denoted  a -kb,  to  the  n-tuple  c  the  jth  component 
of  which,  Cj,  is  defined  as  follows: 


Ci  = 


E 

r+s  =  j  mod  n 


ar  *  bs 
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Show  that  the  wrapped  convolution  on  n-tuples  contains  the  standard  convolution  on 
[(n  +  1 )  / 2j  -tuples  as  a  subfunction  and  vice  versa. 

Hint:  In  both  halves  of  the  problem,  it  helps  to  characterize  the  standard  and  wrapped 
convolutions  as  matrix-vector  products.  It  is  straightforward  to  show  that  the  wrapped 
convolution  contains  the  standard  convolution  as  a  subfunction.  To  show  the  other  re¬ 
sult,  observe  that  the  matrix  characterizing  the  standard  convolution  contains  a  Toeplitz 
matrix  as  a  submatrix.  Consider,  for  example,  the  standard  convolution  of  two  six- 
tuples.  The  matrix  associated  with  the  wrapped  convolution  contains  a  special  type  of 
Toeplitz  matrix. 

6.20  Show  that  the  standard  convolution  function  fconv  ■  R2n  i— >  R2n~2  is  a  subfunction 
of  the  integer  multiplication  function,  /„"©  :  B2n^losn^  t— >  g2n[lognl  0f  Section  2.9 
when  R  is  the  ring  of  integers  modulo  2. 

Hint:  Represent  the  two  sequences  to  be  convolved  as  binary  numbers  that  have  been 
padded  with  zeros  so  that  at  most  one  bit  in  a  sequence  appears  among  [log  n~\  posi¬ 
tions. 

DISCRETE  FOURIER  TRANSFORM 

6.21  Let  n  =  2k .  Use  proof  by  induction  to  show  that  for  all  elements  a  of  a  commutative 
ring  1Z  the  following  identity  holds,  where  ]”[  is  the  product  operation: 

n— 1  k— 1 

= mi+f) 

3=0  3—  0 

6.22  Let  n  =  2k  and  let  1Z  be  a  commutative  ring.  For  ui  £  1Z,  ui  yf  0,  let  m  =  aW2  +  1. 
Show  that  for  1  <  p  <  n 

n—  1 

ujP2  =  0  mod  m 

3=0 

Hint:  Represent  p  as  the  product  of  the  largest  power  of  2  with  an  odd  integer  and 
apply  the  result  of  Problem  6.21. 

6.23  Let  n  and  u>  be  positive  powers  of  two.  Let  m  =  w"/2  +  1.  Show  that  in  the  ring  7Lm 
of  integers  modulo  m  the  integer  n  has  a  multiplicative  inverse  and  that  w  is  a  principal 
nth  root  of  unity. 

6.24  Let  n  be  even.  Use  the  results  of  Problems  6.21,  6.22,  and  6.23  to  show  that  7Lm, 
the  set  of  integers  modulo  m,  m  =  2tn'2  +  1  for  any  positive  integer  t  >  2,  is  a 
commutative  ring  in  which  u>  =  2*  is  a  principal  nth  root  of  unity. 

6.25  Let  w  be  a  principal  nth  root  of  unity  of  the  commutative  ring  1Z  =  (R,  +,  *,  0,  1). 
Show  that  lo2  is  a  principal  (n/2)th  root  of  unit. 

6.26  A  circulant  is  an  n  x  n  matrix  in  which  the  rth  row  is  the  rth  cyclic  shift  of  the  first 
row,  2  <  r  <  n.  When  n  is  a  prime,  show  that  computing  the  DFT  of  a  vector  of 
length  n  is  equivalent  to  multiplying  by  an  (n  —  1)  X  (n  —  1)  circulant. 

6.27  Show  that  the  multiplication  of  circulant  matrix  with  a  vector  can  be  done  by  a  circuit 
of  size  0(n  log  n)  and  depth  O  (log  n) . 
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X\  - - 

X2 - 

Xi - - 

X4 _ 

X5  - - 

X6 - 

x7 - 

Figure  6.13  A  bitonic  sorter  on  seven  inputs. 


MERGING  AND  SORTING 

6.28  Prove  Theorem  6.8.3. 

6.29  Show  that  the  recurrences  given  below  and  stated  in  the  proof  of  Theorem  6.8.2  have 
the  solutions  shown,  where  C(0)  =  1  and  D{ 0)  =  1: 

C(k)  =  2C{k  -  1)  +  2k  =  (k+  \)2k 
D(k)  =  D(k  —  1)  +  1  =  k+  1 

6.30  A  sequence  (x\,x2,  ■  ■  ■  ,xn )  is  bitonic  if  there  is  an  integer  0  <  k  <  n  such  that 

X\  >  . . .  >  Xk  <  . . .  <  xn. 

a)  Show  that  a  bitonic  sorting  network  can  be  constructed  as  follows:  i)  sort  (xi, 
x^,  Xi,  . .  .)  and  (x2,  X4,  x$,  .  . .)  in  bitonic  sorters  whose  lines  are  interleaved,  ii) 
compare  and  interchange  the  outputs  in  pairs,  beginning  with  the  least  significant 
pairs.  (See  Fig.  6.13.) 

b)  Show  that  two  ordered  lists  can  be  merged  with  a  bitonic  sorter  and  that  an  n-sorter 
can  be  constructed  from  bitonic  sorters. 

c)  Determine  the  number  of  comparators  in  a  2k  -sorter  based  on  merging  with  bitonic 
sorters. 

Chapter  Notes 

The  bulk  of  this  chapter  concerns  matrix  computations,  a  topic  with  a  long  history.  Many 
books  have  been  written  on  this  subject  to  which  the  interested  reader  may  refer.  (See  [25], 
[44],  [105],  [198],  and  [362].) 

Among  the  more  important  recent  results  in  this  area  are  the  matrix  multiplication  algo¬ 
rithm  of  Strassen  [319].  Many  other  improvements  have  been  made  on  this  work,  among  the 
most  significant  of  which  is  the  demonstration  by  Coppersmith  and  Winograd  [81]  that  two 
n  X  n  matrices  can  be  multiplied  with  0(n2'376)  ring  operations. 

The  relationships  between  transitive  closure  and  matrix  multiplication  embodied  in  Theo¬ 
rems  6.4.2  and  6.4.3  as  well  as  the  generalization  of  these  results  to  closed  semirings  are  taken 
from  the  book  by  Aho,  Hopcroft,  and  Ullman  [10]. 
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Winograd  [364]  demonstrated  that  matrix  multiplication  is  no  harder  than  matrix  inver¬ 
sion,  whereas  Aho,  Hopcroft,  and  Ullman  [10]  demonstrated  the  converse. 

Csanky’s  algorithm  for  matrix  inversion  is  reported  in  [82].  Leverrier’s  method  for  com¬ 
puting  the  characteristic  function  of  a  matrix  is  described  in  [98] . 

Although  the  FFT  algorithm  became  well  known  through  the  work  of  Cooley  and  Tukey 
[80] ,  the  idea  actually  begins  with  Gauss  in  1805!  (See  Heideman,  Johnson,  and  Burrus  [  1 30] .) 

The  zero-one  principle  for  the  study  of  comparator  networks  is  due  to  Knuth  [170] .  Oddly 
enough,  Batcher’s  odd-even  merging  network  is  due  to  Batcher  [29]. 

Borodin  and  Munro  [56]  is  a  good  early  source  for  arithmetic  complexity,  the  size  and 
depth  of  arithmetic  circuits  for  problems  related  to  matrices  and  polynomials.  More  recent 
work  on  the  parallel  evaluation  of  arithmetic  circuits  is  surveyed  by  JaJa  [148,  Chapter  8]  and 
von  zur  Gathen  [111]. 
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Parallel  Computation 


Parallelism  takes  many  forms  and  appears  in  many  guises.  It  is  exhibited  at  the  CPU  level  when 
microinstructions  are  executed  simultaneously.  It  is  also  present  when  an  arithmetic  or  logic 
operation  is  realized  by  a  circuit  of  small  depth,  as  with  carry-save  addition.  And  it  is  present 
when  multiple  computers  are  connected  together  in  a  network.  Parallelism  can  be  available  but 
go  unused,  either  because  an  application  was  not  designed  to  exploit  parallelism  or  because  a 
problem  is  inherently  serial. 

In  this  chapter  we  examine  a  number  of  explicitly  parallel  models  of  computation,  includ¬ 
ing  shared  and  distributed  memory  models  and,  in  particular,  linear  and  multidimensional 
arrays,  hypercube-based  machines,  and  the  PRAM  model.  We  give  a  broad  introduction  to 
a  large  and  representative  set  of  models,  describing  a  handful  of  good  parallel  programming 
techniques  and  showing  through  analysis  the  limits  on  parallelization.  Because  of  the  limited 
use  so  far  of  parallel  algorithms  and  machines,  the  wide  range  of  hardware  and  software  models 
developed  by  the  research  community  has  not  yet  been  fully  digested  by  the  computer  industry. 

Parallelism  in  logic  and  algebraic  circuits  is  also  examined  in  Chapters  2  and  6.  The  block 
I/O  model,  which  characterizes  parallelism  at  the  disk  level,  is  presented  in  Section  1 1.6  and 
the  classification  of  problems  by  their  execution  time  on  parallel  machines  is  discussed  in  Sec¬ 
tion  8.15.2. 
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7.1  Parallel  Computational  Models 

A  parallel  computer  is  any  computer  that  can  perform  more  than  one  operation  at  time. 
By  this  definition  almost  every  computer  is  a  parallel  computer.  For  example,  in  the  pursuit 
of  speed,  computer  architects  regularly  perform  multiple  operations  in  each  CPU  cycle:  they 
execute  several  microinstructions  per  cycle  and  overlap  input  and  output  operations  (I/O)  (see 
Chapter  11)  with  arithmetic  and  logical  operations.  Architects  also  design  parallel  computers 
that  are  either  several  CPU  and  memory  units  attached  to  a  common  bus  or  a  collection  of 
computers  connected  together  via  a  network.  Clearly  parallelism  is  common  in  computer 
science  today. 

However,  several  decades  of  research  have  shown  that  exploiting  large-scale  parallelism  is 
very  hard.  Standard  algorithmic  techniques  and  their  corresponding  data  structures  do  not 
parallelize  well,  necessitating  the  development  of  new  methods.  In  addition,  when  parallelism 
is  sought  through  the  undisciplined  coordination  of  a  large  number  of  tasks,  the  sheer  number 
of  simultaneous  activities  to  which  one  human  mind  must  attend  can  be  so  large  that  it  is 
often  difficult  to  insure  correctness  of  a  program  design.  The  problems  of  parallelism  are 
indeed  daunting. 

Small  illustrations  of  this  point  are  seen  in  Section  2.7.1,  which  presents  an  0(log  n)-step, 
0(n) -gate  addition  circuit  that  is  considerably  more  complex  than  the  ripple  adder  given  in 
Section  2.7.  Similarly,  the  fast  matrix  inversion  straight-line  algorithm  of  Section  6.5.5  is  more 
complex  than  other  such  algorithms  (see  Section  6.5). 

In  this  chapter  we  examine  forms  of  parallelism  that  are  more  coarse-grained  than  is  typ¬ 
ically  found  in  circuits.  We  assume  that  a  parallel  computer  consists  of  multiple  processors 
and  memories  but  that  each  processor  is  primarily  serial.  That  is,  although  a  processor  may 
realize  its  instructions  with  parallel  circuits,  it  typically  executes  only  one  or  a  small  number  of 
instructions  simultaneously.  Thus,  most  of  the  parallelism  exhibited  by  our  parallel  computer 
is  due  to  parallel  execution  by  its  processors. 

We  also  describe  a  few  programming  styles  that  encourage  a  parallel  style  of  programming 
and  offer  promise  for  user  acceptance.  Finally,  we  present  various  methods  of  analysis  that 
have  proven  useful  in  either  determining  the  parallel  time  needed  for  a  problem  or  classifying 
a  problem  according  to  its  need  for  parallel  time. 

Given  the  doubling  of  CPU  speed  every  two  or  three  years,  one  may  ask  whether  we  can’t 
just  wait  until  CPU  performance  catches  up  with  demand.  Unfortunately,  the  appetite  for 
speed  grows  faster  than  increases  in  CPU  speed  alone  can  meet.  Today  many  problems,  es¬ 
pecially  those  involving  simulation  of  physical  systems,  require  teraflop  computers  (those  per¬ 
forming  1012  floating-point  operations  per  second  (FLOPS))  but  it  is  predicted  that  petaflop 
computers  (performing  1015  FLOPS)  are  needed.  Achieving  such  high  levels  of  performance 
with  a  handful  of  CPUs  may  require  CPU  performance  beyond  what  is  physically  possible  at 
reasonable  prices. 


7.2  Memoryless  Parallel  Computers 

The  circuit  is  the  premier  parallel  memoryless  computational  model:  input  data  passes  through 
a  circuit  from  inputs  to  outputs  and  disappears.  A  circuit  is  described  by  a  directed  acyclic 
graph  in  which  vertices  are  either  input  or  computational  vertices.  Input  values  and  the  re¬ 
sults  of  computations  are  drawn  from  a  set  associated  with  the  circuit.  (In  the  case  of  logic 
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(a) 


Figure  7. 1  Examples  of  Boolean  and  algebraic  circuits. 


(b) 


circuits,  these  values  are  drawn  from  the  set  B  =  {0,  1}  and  are  called  Boolean.)  The  function 
computed  at  a  vertex  is  defined  through  functional  composition  with  values  associated  with 
computational  and  input  vertices  on  which  the  vertex  depends.  Boolean  logic  circuits  are  dis¬ 
cussed  at  length  in  Chapters  2  and  9.  Algebraic  and  combinatorial  circuits  are  the  subject  of 
Chapter  6.  (See  Fig.  7.1.) 

A  circuit  is  a  form  of  unstructured  parallel  computer.  No  order  or  structure  is  assumed 
on  the  operations  that  are  performed.  (Of  course,  this  does  not  prevent  structure  from  being 
imposed  on  a  circuit.)  Generally  circuits  are  a  form  of  fine-grained  parallel  computer;  that 
is,  they  typically  perform  low-level  operations,  such  as  AND,  OR,  or  NOT  in  the  case  of  logic 
circuits,  or  addition  and  multiplication  in  the  case  of  algebraic  circuits.  However,  if  the  set 
of  values  on  which  circuits  operate  is  rich,  the  corresponding  operations  can  be  complex  and 
coarse-grained. 

The  dataflow  computer  is  a  parallel  computer  designed  to  simulate  a  circuit  computation. 
It  maintains  a  list  of  operations  and,  when  all  operands  of  an  operation  have  been  computed, 
places  that  operation  on  a  queue  of  runnable  jobs. 

We  now  examine  a  variety  of  structured  computational  models,  most  of  which  are  coarse¬ 
grained  and  synchronous. 

7.3  Parallel  Computers  with  Memory 

Many  coarse-grained,  structured  parallel  computational  models  have  been  developed.  In  this 
section  we  introduce  these  models  as  well  as  a  variety  of  performance  measures  for  parallel 
computers. 
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There  are  many  ways  to  characterize  parallel  computers.  A  fine-grained  parallel  computer 
is  one  in  which  the  focus  is  on  its  constituent  components,  which  themselves  consist  of  low- 
level  entities  such  as  logic  gates  and  binary  memory  cells.  A  coarse-grained  parallel  computer 
is  one  in  which  we  ignore  the  low-level  components  of  the  computer  and  focus  instead  on  its 
functioning  at  a  high  level.  A  complex  circuit,  such  as  a  carry-lookahead  adder,  whose  details 
are  ignored  is  a  single  coarse-grained  unit,  whereas  one  whose  details  are  studied  explicitly  is 
fine-grained.  CPUs  and  large  memory  units  are  generally  viewed  as  coarse-grained. 

A  parallel  computer  is  a  collection  of  interconnected  processors  (CPUs  or  memories).  The 
processors  and  the  media  used  to  connect  them  constitute  a  network.  If  the  processors  are 
in  close  physical  proximity  and  can  communicate  quickly,  we  often  say  that  they  are  tightly 
coupled  and  call  the  machine  a  parallel  computer  rather  than  a  computer  network.  How¬ 
ever,  when  the  processors  are  not  in  close  proximity  or  when  their  operating  systems  require  a 
large  amount  of  time  to  exchange  messages,  we  say  that  they  are  loosely  coupled  and  call  the 
machine  a  computer  network. 

Unless  a  problem  is  trivially  parallel,  it  must  be  possible  to  exchange  messages  between 
processors.  A  variety  of  low-level  mechanisms  are  generally  available  for  this  purpose.  The  use 
of  software  for  the  exchange  of  potentially  long  messages  is  called  message  passing.  In  a  tightly 
coupled  parallel  computer,  messages  are  prepared,  sent,  and  received  quickly  relative  to  the 
clock  speed  of  its  processors,  but  in  a  loosely  coupled  parallel  computer,  the  time  required  for 
these  steps  is  much  larger.  The  time  Tm  to  transmit  a  message  from  one  processor  to  another 
is  generally  assumed  to  be  of  the  form  Tm  =  a  +  1(3,  where  l  is  the  length  of  the  message  in 
words,  a  (latency)  is  the  time  to  set  up  a  communication  channel,  and  (3  (bandwidth)  is  the 
time  to  send  and  receive  one  word.  Both  a  and  (3  are  constant  multiples  of  the  duration  of 
the  CPU  clock  cycle  of  the  processors.  Thus,  a  +  f3  is  the  time  to  prepare,  send,  and  receive 
a  single-word  message.  In  a  tightly  coupled  machine  a  and  (3  are  small,  whereas  in  a  loosely 
coupled  machine  a  is  large. 

An  important  classification  of  parallel  computers  with  memory  is  based  on  the  degree  to 
which  they  share  access  to  memory.  A  shared-memory  computer  is  characterized  by  a  model 
in  which  each  processor  can  address  locations  in  a  common  memory.  (See  Fig.  7.2(a).)  In 
this  model  it  is  generally  assumed  that  the  time  to  make  one  access  to  the  common  mem- 


fa) 
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Figure  7.2  (a)  A  shared-memory  computer;  (b)  a  distributed-memory  computer. 
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ory  is  relatively  close  to  the  time  for  a  processor  to  access  one  of  its  registers.  Processors  in  a 
shared-memory  computer  can  communicate  with  one  another  via  the  common  memory.  The 
distributed-memory  computer  is  characterized  by  a  model  in  which  processors  can  commu¬ 
nicate  with  other  processors  only  by  sending  messages.  (See  Fig.  7.2(b).)  In  this  model  it  is 
generally  assumed  that  processors  also  have  local  memories  and  that  the  time  to  send  a  message 
from  one  processor  to  another  can  be  large  relative  to  the  time  to  access  a  local  memory.  A  third 
type  of  computer,  a  cross  between  the  first  two,  is  the  distributed  shared-memory  computer. 
It  is  realized  on  a  distributed-memory  computer  on  which  the  time  to  process  messages  is  large 
relative  to  the  time  to  access  a  local  memory,  but  a  layer  of  software  gives  the  programmer  the 
illusion  of  a  shared-memory  computer.  Such  a  model  is  useful  when  programs  can  be  executed 
primarily  from  local  memories  and  only  occasionally  must  access  remote  memories. 

Parallel  computers  are  synchronous  if  all  processors  perform  operations  in  lockstep  and 
asynchronous  otherwise.  A  synchronous  parallel  machine  may  alternate  between  executing 
instructions  and  reading  from  local  or  common  memory.  (See  the  PRAM  model  of  Sec¬ 
tion  7.9,  which  is  a  synchronous,  shared-memory  model.)  Although  a  synchronous  parallel 
computational  model  is  useful  in  conveying  concepts,  in  many  situations,  as  with  loosely  cou¬ 
pled  distributed  computers,  it  is  not  a  realistic  one.  In  other  situations,  such  as  in  the  design 
of  VLSI  chips,  it  is  realistic.  (See,  for  example,  the  discussion  of  systolic  arrays  in  Section  7.5.) 

7.3.1  Flynn’s  Taxonomy 

Flynn’s  taxonomy  of  parallel  computers  distinguishes  between  four  extreme  types  of  paral¬ 
lel  machine  on  the  basis  of  the  degree  of  simultaneity  in  their  handling  of  instructions  and 
data.  The  single-instruction,  single-data  (SISD)  model  is  a  serial  machine  that  executes  one 
instruction  per  unit  time  on  one  data  item.  An  SISD  machine  is  the  simplest  form  of  serial 
computer.  The  single-instruction,  multiple-data  (SIMD)  model  is  a  synchronous  parallel 
machine  in  which  all  processors  that  are  not  idle  execute  the  same  instruction  on  potentially 
different  data.  (See  Fig.  7.3.)  The  multiple-instruction,  single-data  (MISD)  model  de¬ 
scribes  a  synchronous  parallel  machine  that  performs  different  computations  on  the  same  data. 
While  not  yet  practical,  the  MISD  machine  could  be  used  to  test  the  primality  of  an  inte¬ 
ger  (the  single  datum)  by  having  processors  divide  it  by  independent  sets  of  integers.  The 


Figure  7.3  In  the  SIMD  model  the  same  instruction  is  executed  on  every  processor  that  is 
not  idle. 
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multiple-instruction,  multiple-data  (MIMD)  model  describes  a  parallel  machine  that  runs 
a  potentially  different  program  on  potentially  different  data  on  each  processor  but  can  send 
messages  among  processors. 

The  SIMD  machine  is  generally  designed  to  have  a  single  instruction  decoder  unit  that 
controls  the  action  of  each  processor,  as  suggested  in  Fig.  7.3.  SIMD  machines  have  not  been  a 
commercial  success  because  they  require  specialized  processors  rather  than  today’s  commodity 
processors  that  benefit  from  economies  of  scale.  As  a  result,  most  parallel  machines  today  are 
MIMD.  Nonetheless,  the  SIMD  style  of  programming  remains  appealing  because  programs 
having  a  single  thread  of  control  are  much  easier  to  code  and  debug.  In  addition,  a  MIMD 
model,  the  more  common  parallel  model  in  use  today,  can  be  programmed  in  a  SIMD  style. 

While  the  MIMD  model  is  often  assumed  to  be  much  more  powerful  than  the  SIMD 
one,  we  now  show  that  the  former  can  be  converted  to  the  latter  with  at  most  a  constant 
factor  slowdown  in  execution  time.  Let  K  be  the  maximum  number  of  different  instructions 
executable  by  a  MIMD  machine  and  index  them  with  integers  in  the  set  {1,2,3 .,K}. 
Slow  down  the  computation  of  each  machine  by  a  factor  K  as  follows:  1)  identify  time  intervals 
of  length  K,  2)  on  the  fcth  step  of  the  jth  interval,  execute  the  /cth  instruction  of  a  processor  if 
this  is  the  instruction  that  it  would  have  performed  on  the  jth  step  of  the  original  computation. 
Otherwise,  let  the  processor  be  idle  by  executing  its  NOOP  instruction.  This  construction 
executes  the  instructions  of  a  MIMD  computation  in  a  SIMD  fashion  (all  processors  either 
are  idle  or  execute  the  instruction  with  the  same  index)  with  a  slowdown  by  a  factor  K  in 
execution  time. 

Although  for  most  machines  this  simulation  is  impractical,  it  does  demonstrate  that  in  the 
best  case  a  SIMD  program  is  at  worst  a  constant  factor  slower  than  the  corresponding  MIMD 
program  for  the  same  problem.  It  offers  hope  that  the  much  simpler  SIMD  programming  style 
can  be  made  close  in  performance  to  the  more  difficult  MIMD  style. 


7.3.2  The  Data-Parallel  Model 

The  data-parallel  model  captures  the  essential  features  of  the  SIMD  style.  It  has  a  single 
thread  of  control  in  which  serial  and  parallel  operations  are  intermixed.  The  parallel  opera¬ 
tions  possible  typically  include  vector  and  shifting  operations  (see  Section  2.5.1),  prefix  and 
segmented  prefix  computations  (see  Sections  2.6),  and  data-movement  operations  such  as  are 
realized  by  a  permutation  network  (see  Section  7.8.1).  They  also  include  conditional  vector 
operations,  vector  operations  that  are  performed  on  those  vector  components  for  which  the 
corresponding  component  of  an  auxiliary  flag  vector  has  value  1  (others  have  value  0). 

Figure  7.4  shows  a  data-parallel  program  for  radix  sort.  This  program  sorts  n  d-bit  inte¬ 
gers,  {cc[n], . .  . ,  a;[l]},  represented  in  binary.  The  program  makes  d  passes  over  the  integers. 
On  each  pass  the  program  reorders  the  integers,  placing  those  whose  j th  least  significant  bit 
(lsb)  is  1  ahead  of  those  for  which  it  is  0.  This  reordering  is  stable;  that  is,  the  previous  or¬ 
dering  among  integers  with  the  same  jth  lsb  is  retained.  After  the  jth  pass,  the  n  integers  are 
sorted  according  to  their  j  least  significant  bits,  so  that  after  d  passes  the  list  is  fully  sorted. 
The  prefix  function  jT  computes  the  running  sum  of  the  jth  lsb  on  the  jth  pass.  Thus,  for 
k  such  that  x[k]j  =  1  (0),  bk  ( Cfc )  is  the  number  of  integers  with  index  k  or  higher  whose 
jth  lsb  is  1  (0).  The  value  of  a k  =  bkx[k]j  +  (cfc  +  b\)x\k\j  is  bk  or  cjt  +  &i,  depending  on 
whether  the  lsb  of  x[k]  is  1  or  0,  respectively.  That  is,  a*;  is  the  index  of  the  location  in  which 
the  A'th  integer  is  placed  after  the  jth  pass. 
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{  x[n]j  is  the  j  th  least  significant  bit  of  the  nth  integer.  } 

{  After  the  jth  pass,  the  integers  are  sorted  by  their  j  least  significant  bits.  } 
{  Upon  completion,  the  fcth  location  contains  the  fcth  largest  integer.  } 

for  j  :  =  0  to  d  —  1 

begin 

(bn,  ■  •  ■ ,  &i)  :=  P+\x[n]j, 

{  bk  is  the  number  of  l’s  among  x[n]j, . . .  ,x[k]j.  } 

{  is  the  number  of  integers  whose  jth  bit  is  1.  } 

(cn,  ■  •  ■ ,  Cl)  :=  V+\x[n]j, . .  .,a:[l]j); 

{  Ck  is  the  number  of  0’s  among  x[n\j,  .  . .,  x[fc]j.  } 


(a„, . . .  ,Oi)  :=  (bnx[n]j  +  (cn  +  bi)x[n]j, . . . ,  b\x[\]j  +  (ci  +  b\)x[\]jy, 

{  a*,  =  bkx[k\j  +  ( Ck  +  bi)x[k\j  is  the  rank  of  the  /cth  key.  } 

(x[n  +  1  —  an\,x[n  +  1  —  an_i], . . . ,  x[n  +  1  —  ai])  :=  (x[n],x[n  —  1], . . . ,  rr[l]) 
{  This  operation  permutes  the  integers.  } 

end 

Figure  7.4  A  data-parallel  radix  sorting  program  to  sort  n  d-bit  binary  integers  that  makes  two 
uses  of  the  prefix  function  V © . 


The  data-parallel  model  is  often  implemented  using  the  single-program  multiple-data 
(SPMD)  model.  This  model  allows  copies  of  one  program  to  run  on  multiple  processors  with 
potentially  different  data  without  requiring  that  the  copies  run  in  synchrony.  It  also  allows 
the  copies  to  synchronize  themselves  periodically  for  the  transfer  of  data.  A  convenient  ab¬ 
straction  often  used  in  the  data-parallel  model  that  translates  nicely  to  the  SPMD  model  is  the 
assumption  that  a  collection  of  virtual  processors  is  available,  one  per  vector  component.  An 
operating  system  then  maps  these  virtual  processors  to  physical  ones.  This  method  is  effective 
when  there  are  many  more  virtual  processors  than  real  ones  so  that  the  time  for  interprocessor 
communication  is  amortized. 

7.3.3  Networked  Computers 

A  networked  computer  consists  of  a  collection  of  processors  with  direct  connections  between 
them.  In  this  context  a  processor  is  a  CPU  with  memory  or  a  sequential  machine  designed 
to  route  messages  between  processors.  The  graph  of  a  network  has  a  vertex  associated  with 
each  processor  and  an  edge  between  two  connected  processors.  Properties  of  the  graph  of  a 
network,  such  as  its  size  (number  of  vertices),  its  diameter  (the  largest  number  of  edges  on 
the  shortest  path  between  two  vertices),  and  its  bisection  width  (the  smallest  number  of  edges 
between  a  subgraph  and  its  complement,  both  of  which  have  about  the  same  size)  characterize 
its  computational  performance.  Since  a  transmission  over  an  edge  of  a  network  introduces 
delay,  the  diameter  of  a  network  graph  is  a  crude  measure  of  the  worst-case  time  to  transmit 


288 


Chapter  7  Parallel  Computation 


Models  of  Computation 


Figure  7.5  Completely  balanced  (a)  and  unbalanced  (b)  trees. 


a  message  between  processors.  Its  bisection  width  is  a  measure  of  the  amount  of  information 
that  must  be  transmitted  in  the  network  for  processors  to  communicate  with  their  neighbors. 

A  large  variety  of  networks  have  been  investigated.  The  graph  of  a  tree  network  is  a  tree. 
Many  simple  tasks,  such  as  computing  sums  and  broadcasting  (sending  a  message  from  one 
processor  to  all  other  processors),  can  be  done  on  tree  networks.  Trees  are  also  naturally  suited 
to  many  recursive  computations  that  are  characterized  by  divide-and-conquer  strategies,  in 
which  a  problem  is  divided  into  a  number  of  like  problems  of  similar  size  to  yield  small  results 
that  can  be  combined  to  produce  a  solution  to  the  original  problem.  Trees  can  be  completely 
balanced  or  unbalanced.  (See  Fig.  7.5.)  Balanced  trees  of  fixed  degree  have  a  root  and  bounded 
number  of  edges  associated  with  each  vertex.  The  diameter  of  such  trees  is  logarithmic  in 
the  number  of  vertices.  Unbalanced  trees  can  have  a  diameter  that  is  linear  in  the  number  of 
vertices. 

A  mesh  is  a  regular  graph  (see  Section  7.5)  in  which  each  vertex  has  the  same  degree  except 
possibly  for  vertices  on  its  boundary.  Meshes  are  well  suited  to  matrix  operations  and  can  be 
used  for  a  large  variety  of  other  problems  as  well.  If,  as  some  believe,  speed-of-light  limitations 
will  be  an  important  consideration  in  constructing  fast  computers  in  the  future  [43],  the  one-, 
two-,  and  three-dimensional  mesh  may  very  well  become  the  computer  organization  of  choice. 
The  diameter  of  a  mesh  of  dimension  d  with  n  vertices  is  proportional  to  nl^d.  It  is  not  as 
small  as  the  diameter  of  a  tree  but  acceptable  for  tasks  for  which  the  cost  of  communication 
can  be  amortized  over  the  cost  of  computation. 

The  hypercube  (see  Section  7.6)  is  a  graph  that  has  one  vertex  at  each  corner  of  a  mul¬ 
tidimensional  cube.  It  is  an  important  conceptual  model  because  it  has  low  (logarithmic) 
diameter,  large  bisection  width,  and  a  connectivity  for  which  it  is  easy  to  construct  efficient 
parallel  algorithms  for  a  large  variety  of  problems.  While  the  hypercube  and  the  tree  have  sim¬ 
ilar  diameters,  the  superior  connectivity  of  the  hypercube  leads  to  algorithms  whose  running 
time  is  generally  smaller  than  on  trees.  Fortunately,  many  hypercube-based  algorithms  can  be 
efficiently  translated  into  algorithms  for  other  network  graphs,  such  as  meshes. 

We  demonstrate  the  utility  of  each  of  the  above  models  by  providing  algorithms  that  are 
naturally  suited  to  them.  For  example,  linear  arrays  are  good  at  performing  matrix-vector 
multiplications  and  sorting  with  bubble  sort.  Two-dimensional  meshes  are  good  at  matrix- 
matrix  multiplication,  and  can  also  be  used  to  sort  in  much  less  time  than  linear  arrays.  The 
hypercube  network  is  very  good  at  solving  a  variety  of  problems  quickly  but  is  much  more 
expensive  to  realize  than  linear  or  two-dimensional  meshes  because  each  processor  is  connected 
to  many  more  other  processors. 
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Figure  7.6  A  crossbar  connection  network.  Any  two  processors  can  be  connected. 


In  designing  parallel  algorithms  it  is  often  helpful  to  devise  an  algorithm  for  a  particular 
parallel  machine  model,  such  as  a  hypercube,  and  then  map  the  hypercube  and  the  algo¬ 
rithm  with  it  to  the  model  of  the  machine  on  which  it  will  be  executed.  In  doing  this,  the 
question  arises  of  how  efficiently  one  graph  can  be  embedded  into  another.  This  is  the  graph¬ 
embedding  problem.  We  provide  an  introduction  to  this  important  question  by  discussing 
embeddings  of  one  type  of  machine  into  another. 

A  connection  network  is  a  network  computer  in  which  all  vertices  except  for  peripheral 
vertices  are  used  to  route  messages.  The  peripheral  vertices  are  the  computers  that  are  con¬ 
nected  by  the  network.  One  of  the  simplest  such  networks  is  the  crossbar  network,  in  which 
a  row  of  processors  is  connected  to  a  column  of  processors  via  a  two-dimensional  array  of 
switches.  (See  Fig.  7.6.)  The  crossbar  switch  with  2 n  computational  processors  has  n2  routing 
vertices.  The  butterfly  network  (see  Fig.  7. 1 5)  provides  a  connectivity  similar  to  that  of  the 
crossbar  but  with  many  fewer  routing  vertices.  Flowever,  not  all  permutations  of  the  inputs  to 
a  butterfly  can  be  mapped  to  its  outputs.  For  this  purpose  the  Benes  network  (see  Fig.  7.20) 
is  better  suited.  It  consists  of  two  butterfly  graphs  with  the  outputs  of  one  graph  connected  to 
the  outputs  of  the  second  and  the  order  of  edges  of  the  second  reversed.  Many  other  permuta¬ 
tion  networks  exist.  Designers  of  connection  networks  are  very  concerned  with  the  variety  of 
connections  that  can  be  made  among  computational  processors,  the  time  to  make  these  con¬ 
nections,  and  the  number  of  vertices  in  the  network  for  the  given  number  of  computational 
processors.  (See  Section  7.8.) 

7.4  The  Performance  of  Parallel  Algorithms 

We  now  examine  measures  of  performance  of  parallel  algorithms.  Of  these,  computation  time 
is  the  most  important.  Since  parallel  computation  time  Tp  is  a  function  of  p,  the  number  of 
processors  used  for  a  computation,  we  seek  a  relationship  among  p,  Tp,  and  other  measures  of 
the  complexity  of  a  problem. 

Given  a  p-processor  parallel  machine  that  executes  Tp  steps,  in  the  spirit  of  Chapter  3,  we 
can  construct  a  circuit  to  simulate  it.  Its  size  is  proportional  to  pTp,  which  plays  the  role  of 
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serial  time  Ts.  Similarly,  a  single-processor  RAM  of  the  type  used  in  a  p-processor  parallel 
machine  but  with  p  times  as  much  memory  can  simulate  an  algorithm  on  the  parallel  machine 
in  p  times  as  many  steps;  it  simulates  each  step  of  each  of  the  p  RAM  processors  in  succession. 
This  observation  provides  the  following  relationship  among  p,  Tp,  and  Ts  when  storage  space 
for  the  serial  and  parallel  computations  is  comparable. 

THEOREM  7.4. 1  Let  Ts  be  the  smallest  number  of  steps  needed  on  a  single  RAM  with  storage 
capacity  S,  in  bits,  to  compute  a  function  f.  If  f  can  be  computed  in  Tp  steps  on  a  network  ofp 
RAM  processors,  each  with  storage  S/p,  then  Tp  satisfies  the  following  inequality: 

pTp>Ts  (7.1) 

Proof  This  result  follows  because,  while  the  serial  RAM  can  simulate  the  parallel  machine 
in  pTp  steps,  it  may  be  able  to  compute  the  function  in  question  more  quickly.  ■ 

The  speedup  S  of  a  parallel  p-processor  algorithm  over  the  best  serial  algorithm  for  a  prob¬ 
lem  is  defined  as  S  =  Ts /Tp.  We  see  that,  with  p  processors,  a  speedup  of  at  most  p  is  possible; 
that  is,  S  <  p.  This  result  can  also  be  stated  in  terms  of  the  computational  work  done  by  serial 
and  parallel  machines,  defined  as  the  number  of  equivalent  serial  operations.  (Computational 
work  is  defined  in  terms  of  the  equivalent  number  of  gate  operations  in  Section  3.1.2.  The 
two  measures  differ  only  in  terms  of  the  units  in  which  work  is  measured,  CPU  operations  in 
this  section  and  gate  operations  in  Section  3.1.2.)  The  computational  work  Wp  done  by  an 
algorithm  on  a  p-processor  RAM  machine  is  Wp  =  pTp.  The  above  theorem  says  that  the 
minimal  parallel  work  needed  to  compute  a  function  is  at  least  the  serial  work  required  for  it, 
that  is,  Wp  >  Ws  =  Ts.  (Note  that  we  compare  the  work  on  a  serial  processor  to  a  collection 
of  p  identical  processors,  so  that  we  need  not  take  into  account  differences  among  processors.) 

A  parallel  algorithm  is  efficient  if  the  work  that  it  does  is  close  to  the  work  done  by  the 
best  serial  algorithm.  A  parallel  algorithm  is  fast  if  it  achieves  a  nearly  maximal  speedup.  We 
leave  unspecified  just  how  close  to  optimal  a  parallel  algorithm  must  be  for  it  to  be  classified  as 
efficient  or  fast.  This  will  often  be  determined  by  context.  We  observe  that  parallel  algorithms 
may  be  useful  if  they  complete  a  task  with  acceptable  losses  in  efficiency  or  speed,  even  if  they 
are  not  optimal  by  either  measure. 

7.4. 1  Amdahl’s  Law 

As  a  warning  that  it  is  not  always  possible  with  p  processors  to  obtain  a  speedup  ofp,  we  intro¬ 
duce  Amdahl’s  Law,  which  provides  an  intuitive  justification  for  the  difficulty  of  parallelizing 
some  tasks.  In  Sections  3.9  and  8.9  we  provide  concrete  information  on  the  difficulty  of  par¬ 
allelizing  individual  problems  by  introducing  the  P-complete  problems,  problems  that  are  the 
hardest  polynomial-time  problems  to  parallelize. 

THEOREM  7.4.2  (Amdahl’s  Law)  Let  f  be  the  fraction  of  a  program’s  execution  time  on  a  serial 
RAM  that  is  parallelizable.  Then  the  speedup  S  achievable  by  this  program  on  a  p-processor  RAM 
machine  must  satisfy  the  following  bound: 

S <  1 
-  (1  ~f)  +  f/p 

Proof  Given  a  Ts-step  serial  computation,  fTs/p  is  the  smallest  possible  number  of  steps 
on  a  p-processor  machine  for  the  parallelizable  serial  steps.  The  remaining  (1  —  f)Ts  serial 
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steps  take  at  least  the  same  number  of  steps  on  the  parallel  machine.  Thus,  the  parallel  time 
Tp  satisfies  Tp  >  Ts  [(1  —  /)  +  / /p)  from  which  the  result  follows.  ■ 

This  result  shows  that  if  a  fixed  fraction  /  of  a  program’s  serial  execution  time  can  be 
parallelized,  the  speedup  achievable  with  that  program  on  a  parallel  machine  is  bounded  above 
by  1/(1  —  /)  as  p  grows  without  limit.  For  example,  if  90%  of  the  time  of  a  serial  program 
can  be  parallelized,  the  maximal  achievable  speed  is  10,  regardless  of  the  number  of  parallel 
processors  available. 

While  this  statement  seems  to  explain  the  difficulty  of  parallelizing  certain  algorithms,  it 
should  be  noted  that  programs  for  serial  and  parallel  machines  are  generally  very  different. 
Thus,  it  is  not  reasonable  to  expect  that  analysis  of  a  serial  program  should  lead  to  bounds  on 
the  running  time  of  a  parallel  program  for  the  same  problem. 

7.4.2  Brent’s  Principle 

We  now  describe  how  to  convert  the  inherent  parallelism  of  a  problem  into  an  efficient  parallel 
algorithm.  Brent’s  principle,  stated  in  Theorem  7.4.3,  provides  a  general  schema  for  exploiting 
parallelism  in  a  problem. 

THEOREM  7.4.3  Consider  a  computation  C  that  can  he  done  in  t  parallel  steps  ivhen  the  time 
to  communicate  between  operations  can  he  ignored.  Let  rrii  be  the  number  of  primitive  operations 
done  on  the  ith  step  and  let  m  =  y/,_i  Wj.  Consider  a  p-processor  machine  M  capable  of  the 
same  primitive  operations,  where  p  <  max^  m^.  If  the  communication  time  between  the  operations 
in  C  on  M  can  be  ignored,  the  same  computation  can  be  performed  in  Tp  steps  on  M,  where  Tp 
satisfies  the  following  bound: 

Tp  <  (m/p)  +  t 

Proof  A  parallel  step  in  which  m,;  operations  are  performed  can  be  simulated  by  M  in 
\mi/p\  <  (mt/p)  +  1  steps,  from  which  the  result  follows.  ■ 

Brent’s  principle  provides  a  schema  for  realizing  the  inherent  parallelism  in  a  problem. 
However,  it  is  important  to  note  that  the  time  for  communication  between  operations  can 
be  a  serious  impediment  to  the  efficient  implementation  of  a  problem  on  a  parallel  machine. 
Often,  the  time  to  route  messages  between  operations  can  be  the  most  important  limitation 
on  exploitation  of  parallelism. 

We  illustrate  Brent’s  principle  with  the  problem  of  adding  n  integers,  x\, . . . ,  xn,  n  =  2k . 
Under  the  assumption  that  at  most  two  integers  can  be  added  in  one  primitive  operation,  we 
see  that  the  sum  can  be  formed  by  performing  n/2  additions,  n/ 4  additions  of  these  results, 
etc.,  until  the  last  sum  is  formed.  Thus,  mi  =  n/21  for  i  <  [log2  n] .  When  only  p  processors 
are  available,  we  assign  \n/p~\  integers  to  p  —  1  processors  and  n—(p—\)  |~n/p]  integers  to  the 
remaining  processor.  In  \n/p\  steps,  thep  processors  each  compute  their  local  sums,  leaving 
their  results  in  a  reserved  location.  In  each  subsequent  phase,  half  of  the  processors  active  in  the 
preceding  phase  are  active  in  this  one.  Each  active  processor  fetches  the  partial  sum  computed 
by  one  other  processor,  adds  it  to  its  partial  sum,  and  stores  the  result  in  a  reserved  place.  After 
O(logp)  phases,  the  sum  of  the  n  integers  has  been  computed.  This  algorithm  computes  the 
sum  of  the  n  integers  in  0(n/p  +  logp)  time  steps.  Since  the  maximal  speedup  possible  is 
p,  this  algorithm  is  optimal  to  within  a  constant  multiplicative  factor  if  logp  <  (n/p)  or 
p  <  n/logn. 
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It  is  important  to  note  that  the  time  to  communicate  between  processors  is  often  very 
large  relative  to  the  length  of  a  CPU  cycle.  Thus,  the  assumption  that  it  takes  zero  time  to 
communicate  between  processors,  the  basis  of  Brent’s  principle,  holds  only  for  tightly  coupled 
processors. 

7.5  Multidimensional  Meshes 

In  this  section  we  examine  multidimensional  meshes.  A  one-dimensional  mesh  or  linear 
array  of  processors  is  a  one-dimensional  (ID)  array  of  computing  elements  connected  via 
nearest-neighbor  connections.  (See  Fig.  7.7.)  If  the  vertices  of  the  array  are  indexed  with 
integers  from  the  set  {1,  2,  3, ... ,  n},  then  vertex  i,  2  <  i  <  n  —  1,  is  connected  to  vertices 
i  —  1  and  i  +  1.  If  the  linear  array  is  a  ring,  vertices  1  and  n  are  also  connected.  Such  an 
end-to-end  connection  can  be  made  with  short  connections  by  folding  the  linear  array  about 
its  midpoint. 

The  linear  array  is  an  important  model  that  finds  application  in  very  large-scale  integrated 
(VLSI)  circuits.  When  the  processors  of  a  linear  array  operate  in  synchrony  (which  is  the 
usual  way  in  which  they  are  used),  it  is  called  a  linear  systolic  array  (a  systole  is  a  recurrent 
rhythmic  contraction,  especially  of  the  heart  muscle).  A  systolic  array  is  any  mesh  (typically 
ID  or  2D)  in  which  the  processors  operate  in  synchrony.  The  computing  elements  of  a  systolic 
array  are  called  cells.  A  linear  systolic  array  that  convolves  two  binary  sequences  is  described 
in  Section  1 .6. 

A  multidimensional  mesh  (see  Fig.  7.8)  (or  mesh)  offers  better  connectivity  between  pro¬ 
cessors  than  a  linear  array.  As  a  consequence,  a  multidimensional  mesh  generally  can  compute 
functions  more  quickly  than  a  ID  one.  We  illustrate  this  point  by  matrix  multiplication  on 
2D  meshes  in  Section  7.5.3. 

Figure  7.8  shows  2D  and  3D  meshes.  Each  vertex  of  the  2D  mesh  is  numbered  by  a  pair 
(r,  c),  where  0  <  r  <  n  —  1  and  0  <  c  <  n  —  1  are  its  row  and  column  indices.  (If  the  cell 

Ax  i  0  0 

0  A$t2  o 

A2,l  0  A33 

0  A2<2  0 

A\,i  0  A2t  3 

0  Ah2  0 

0  0  Au 


Figure  7.7  A  linear  array  to  compute  the  matrix-vector  product  Ax,  where  A  =  [aij]  and 
xT  =  (x\, .  .  . ,  Xn)-  On  each  cycle,  the  zth  processor  sets  its  current  sum,  Si,  to  the  sum  to  its 
right,  iSi+i,  plus  the  product  of  its  local  value,  Xi,  with  its  vertical  input. 


293 


©John  E  Savage 


7.5  Multidimensional  Meshes 


Figure  7.8  (a)  A  two-dimensional  mesh  with  optional  connections  between  the  boundary 

elements  shown  by  dashed  lines,  (b)  A  3D  mesh  (a  cube)  in  which  elements  are  shown  as  subcubes. 


(r,  c)  is  associated  with  the  integer  rn  +  c,  this  is  the  row-major  order  of  the  cells.  Cells  are 
numbered  left-to-right  from  0  to  3  in  the  first  row,  4  to  7  in  the  second,  8  to  1 1  in  the  third, 
and  12  to  15  in  the  fourth.)  Vertex  (r,  c)  is  adjacent  to  vertices  (r  —  1,  c)  and  (r  +  1,  c)  for 
1  <  r  <  n  —  2.  Similarly,  vertex  (r,  c)  is  adjacent  to  vertices  (r,  c  —  1)  and  (r,  c  +  1)  for 
1  <  c  <  n  —  2.  Vertices  on  the  boundaries  may  or  may  not  be  connected  to  other  boundary 
vertices,  and  may  be  connected  in  a  variety  of  ways.  For  example,  vertices  in  the  first  row 
(column)  can  be  connected  to  those  in  the  last  row  (column)  in  the  same  column  (row)  (this  is 
a  toroidal  mesh)  or  the  next  larger  column  (row).  The  second  type  of  connection  is  associated 
with  the  dashed  lines  in  Fig.  7.8(a). 

Each  vertex  in  a  3D  mesh  is  indexed  by  a  triple  (x,  y,  z),  0  <  X,  y,  z  <  n  —  1,  as  suggested 
in  Fig.  7.8(b).  Connections  between  boundary  vertices,  if  any,  can  be  made  in  a  variety  of 
ways.  Meshes  with  larger  dimensionality  are  defined  in  a  similar  fashion. 

A  d-dimensional  mesh  consists  of  processors  indexed  by  a  d-tuple  (ni,  ri2, . . . ,  rid)  in 
which  0  <  rij  <  Nj  —  1  for  1  <  j  <  d.  If  processors  (ni,  n,2, . . . ,  rid)  and  (mi,  m2,  ■  ■  ■ ,  md) 
are  adjacent,  there  is  some  j  such  that  rii  =  mi  for  j  yf  *  and  \rij  —  mj  |  =  1.  There  may  also 
be  connections  between  boundary  processors,  that  is,  processors  for  which  one  component  of 
their  index  has  either  its  minimum  or  maximum  value. 

7.5.1  Matrix- Vector  Multiplication  on  a  Linear  Array 

As  suggested  in  Fig.  7.7,  the  cells  in  a  systolic  array  can  have  external  as  well  as  nearest-neighbor 
connections.  This  systolic  array  computes  the  matrix-vector  product  Ax  of  an  n  X  n  matrix 
with  an  n-vector.  (In  the  figure,  n  =  3.)  The  cells  of  the  systolic  array  beat  in  a  rhythmic 
fashion.  The  vth  processor  sets  its  current  sum,  Si,  to  the  product  of  Xi  with  its  vertical  input 
plus  the  value  of  Si+i  to  its  right  (the  value  0  is  read  by  the  rightmost  cell).  Initially,  Si  =  0  for 
l  <  i  <  n.  Since  alternating  vertical  inputs  are  0,  the  alternating  values  of  Si  are  0.  In  Fig.  7.7 
the  successive  values  of  S3  are  A13X3,  0,  ^232:3,  0,  A333X3,  0,  0.  The  successive  values  of  S2 
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are  0,  4^*2  +  -41,3*3.  0’  42,2*2  +  ^2,3*3.  0.  ^3,2*2  +  ^3,3*3.  0-  The  successive  values  of  S 1 
are  0,  0,  4i,iXi  +  4i,2*2  + 41,3X3,  0,  42,i*i  +  42,2*2  +  42,3*3,  0,  43,1*1  +  43,2*2  +  43,3x3. 

The  algorithm  described  above  to  compute  the  matrix-vector  product  for  a  3  x  3  matrix 
clearly  extends  to  arbitrary  n  X  n  matrices.  (See  Problem  7.8.)  Since  the  last  element  of  an 
n  X  n  matrix  arrives  at  the  array  after  3n  —  2  time  steps,  such  an  array  will  complete  its  task  in 
3 n—  1  time  steps.  A  lower  bound  on  the  time  for  this  problem  (see  Problem  7.9)  can  be  derived 
by  showing  that  the  n2  entries  of  the  matrix  4  and  the  n  entries  of  the  matrix  x  must  be  read 
to  compute  Ax  correctly  by  an  algorithm,  whether  serial  or  not.  By  Theorem  7.4.1  it  follows 
that  all  systolic  algorithms  using  n  processors  require  n  steps.  Thus,  the  above  algorithm  is 
nearly  optimal  to  within  a  constant  multiplicative  factor. 

THEOREM  7.5. 1  There  exists  a  linear  systolic  array  with  n  cells  that  computes  the  product  of  an 
n  x  n  matrix  with  an  n-vector  in  3n  —  1  steps,  and  no  algorithm  on  such  an  array  can  do  this 
computation  in  fewer  than  n  steps. 

Since  the  product  of  two  n  x  n  matrices  can  be  realized  as  n  matrix-vector  products  with 
an  n  X  n  matrix,  an  n-processor  systolic  array  exists  that  can  multiply  two  matrices  nearly 
optimally. 

7.5.2  Sorting  on  Linear  Arrays 

A  second  application  of  linear  systolic  arrays  is  bubble  sorting  of  integers.  A  sequential  version 
of  the  bubble  sort  algorithm  passes  over  the  entries  in  a  tuple  (xj,  *2,  •  •  ■ ,  xn)  from  left  to 
right  multiple  times.  On  the  first  pass  it  finds  the  largest  element  and  moves  it  to  the  rightmost 
position.  It  applies  the  same  procedure  to  the  first  n—l  elements  of  the  resultant  list,  stopping 
when  it  finds  a  list  containing  one  element.  This  sequential  procedure  takes  time  proportional 
to  n  +  (n  —  1)  +  (n  —  2)  +  •  •  •  +  2  +  1  =  n(n  +  1  )/2. 

A  parallel  version  of  bubble  sort,  sometimes  called  odd-even  transposition  sort,  is  natu¬ 
rally  realized  on  a  linear  systolic  array.  The  n  entries  of  the  array  are  placed  in  n  cells.  Let  c, 
be  the  word  in  the  zth  cell.  We  assume  that  in  one  unit  of  time  two  adjacent  cells  can  read 
words  stored  in  each  other’s  memories  ( Ci  and  Ci+i),  compare  them,  and  swap  them  if  one  (cj) 
is  larger  than  the  other  (c^+i).  The  odd-even  transposition  sort  algorithm  executes  n  stages. 
In  the  even-numbered  stages,  integers  in  even-numbered  cells  are  compared  with  integers  in 
the  next  higher  numbered  cells  and  swapped,  if  larger.  In  the  odd-numbered  stages,  the  same 
operation  is  performed  on  integers  in  odd-numbered  cells.  (See  Fig.  7.9.)  We  show  that  in  n 
steps  the  sorting  is  complete. 

THEOREM  7.5.2  Bubble  sort  ofn  elements  on  a  linear  systolic  array  can  be  done  in  at  most  n  steps. 
Every  algorithm  to  sort  a  list  ofn  elements  on  a  linear  systolic  array  requires  at  least  n—l  steps. 
Thus,  bubble  sort  on  a  linear  systolic  array  is  almost  optimal. 

Proof  To  derive  the  upper  bound  we  use  the  zero-one  principle  (see  Theorem  6.8.1),  which 
states  that  if  a  comparator  network  for  inputs  over  an  ordered  set  A  correctly  sorts  all  binary 
inputs,  it  correctly  sorts  all  inputs.  The  bubble  sort  systolic  array  maps  directly  to  a  com¬ 
parator  network  because  each  of  its  operations  is  data-independent,  that  is,  oblivious.  To 
see  that  the  systolic  array  correctly  sorts  binary  sequences,  consider  the  position,  r,  of  the 
rightmost  1  in  the  array. 
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Figure  7.9  A  systolic  implementation  of  bubble  sort  on  a  sequence  of  five  items.  Underlined 
pairs  of  items  are  compared  and  swapped  if  out  of  order.  The  bottom  row  shows  the  first  set  of 
comparisons. 


If  r  is  even,  on  the  first  phase  of  the  algorithm  this  1  does  not  move.  However,  on  all 
subsequent  phases  it  moves  right  until  it  arrives  at  its  final  position.  If  r  is  odd,  it  moves 
right  on  all  phases  until  it  arrives  in  its  final  position.  Thus  by  the  second  step  the  rightmost 
1  moves  right  on  every  step  until  it  arrives  at  its  final  position.  The  second  rightmost  1  is 
free  to  move  to  the  right  without  being  blocked  by  the  first  1  after  the  second  phase.  This 
second  1  will  move  to  the  right  by  the  third  phase  and  continue  to  do  so  until  it  arrives  at 
its  final  position.  In  general,  the  fcth  rightmost  1  starts  moving  to  the  right  by  the  {k  +  1 ) st 
phase  and  continues  until  it  arrives  at  its  final  position.  It  follows  that  at  most  n  phases  are 
needed  to  sort  the  0-1  sequence.  By  the  zero-one  principle,  the  same  applies  to  all  sequences. 

To  derive  the  lower  bound,  assume  that  the  sorted  elements  are  increasing  from  left  to 
right  in  the  linear  array.  Let  the  elements  initially  be  placed  in  decreasing  order  from  left 
to  right.  Thus,  the  process  of  sorting  moves  the  largest  element  from  the  leftmost  location 
in  the  array  to  the  rightmost.  This  requires  at  least  n  —  1  steps.  The  same  lower  bound 
holds  if  some  other  permutation  of  the  n  elements  is  desired.  For  example,  if  the  /cth  largest 
element  resides  in  the  rightmost  cell  at  the  end  of  the  computation,  it  can  reside  initially  in 
the  leftmost  cell,  requiring  at  least  n  —  1  operations  to  move  to  its  final  position.  ■ 


7.5.3  Matrix  Multiplication  on  a  2D  Mesh 

2D  systolic  arrays  are  natural  structures  on  which  to  compute  the  product  C  =  A  x  B  of 
matrices  A  and  B.  (Matrix  multiplication  is  discussed  in  Section  6.3.)  Since  C  =  A  x  B  can 
be  realized  as  n  matrix-vector  multiplications,  C  can  be  computed  with  n  linear  arrays.  (See 
Fig.  7.7.)  If  the  columns  of  B  are  stored  in  successive  arrays  and  the  entries  of  A  pass  from 
one  array  to  the  next  in  one  unit  of  time,  the  nth  array  receives  the  last  entry  of  B  after  An  —  2 
time  steps.  Thus,  this  2D  systolic  array  computes  C  =  A  x  B  in  4n—  1  steps.  Somewhat 
more  efficient  2D  systolic  arrays  can  be  designed.  We  describe  one  of  them  below. 
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Figure  7.10  shows  a  2D  mesh  for  matrix  multiplication.  Each  cell  of  this  mesh  adds  to 
its  stored  value  the  product  of  the  value  arriving  from  above  and  to  its  left.  These  two  values 
pass  through  the  cells  to  those  below  and  to  their  right,  respectively.  When  the  entries  of  A  are 
supplied  on  the  left  and  those  of  B  are  supplied  from  above  in  the  order  shown,  the  cell  C{,j 
computes  Cjj  ,  the  (i,  j)  entry  of  the  product  matrix  C.  For  example,  cell  C'2,3  accumulates  the 
value  C23  =  <22,1  *  7?i ,3  +  0,2,2  *  (>2,3  +  02,3  *  (>3,3.  After  the  entries  of  C  have  been  computed, 
they  are  produced  as  outputs  by  shifting  the  entries  of  the  mesh  to  one  side  of  the  array.  When 
generalized  to  n  x  n  matrices,  this  systolic  array  requires  2n  —  1  steps  for  the  last  of  the  matrix 
components  to  enter  the  array,  and  another  n  —  1  steps  to  compute  the  last  entry  cn,n.  An 
additional  n  steps  are  needed  to  shift  the  components  of  the  product  matrix  out  of  the  array. 
Thus,  this  systolic  array  performs  matrix  multiplication  in  An  —  2  steps. 

We  put  the  following  requirements  on  every  systolic  array  (of  any  dimension)  that  com¬ 
putes  the  matrix  multiplication  function:  a)  each  component  of  each  matrix  enters  the  array 
at  one  location,  and  b)  each  component  of  the  product  matrix  is  computed  at  a  unique  cell. 
We  now  show  that  the  systolic  matrix  multiplication  algorithm  is  optimal  to  within  a  constant 
multiplicative  factor. 

THEOREM  7.5.3  Two  nxn  matrices  can  be  multiplied  by  an  nxn  systolic  array  in  An  — 2  steps 
and  every  two-dimensional  systolic  array  for  this  problem  requires  at  least  ( n/2 )  —  1  steps. 

Proof  The  proof  that  two  n  X  n  matrices  can  be  multiplied  in  An  —  2  steps  by  a  two- 
dimensional  systolic  array  was  given  above.  We  now  show  that  f l(n)  steps  are  required  to 
multiply  two  nxn  matrices,  A  and  B,  to  produce  the  matrix  C  =  A  x  B.  Observe  that 
the  number  of  cells  in  a  two-dimensional  array  that  are  within  d  moves  from  any  particular 
cell  is  at  most  o(d),  where  cr(d)  =  2d2  +  2 d+  1.  The  maximum  occurs  at  the  center  of  the 
array.  (See  Problem  7.11.) 
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Figure  7. 1  0  A  two-dimensional  mesh  for  the  multiplication  of  two  matrices.  The  entries  in 
these  matrices  are  supplied  in  successive  time  intervals  to  processors  on  the  boundary  of  the  mesh. 
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Given  a  systolic  array  with  inputs  supplied  externally  over  time  (see  Fig.  7.10),  we  enlarge 
the  array  so  that  each  component  of  each  matrix  is  initially  placed  in  a  unique  cell.  The 
enlarged  array  contains  the  original  n  x  n  array. 

Let  C  =  [a,j].  Because  Cij  =  y~)„,  ai>ubUtj,  it  follows  that  for  each  value  of  i,  j,  t,  and 
u  there  is  a  path  from  a,itU  to  the  cell  at  which  c,;j;  is  computed  as  well  as  a  path  from  btj  to 
this  same  cell.  Thus,  it  follows  that  there  is  a  path  in  the  array  between  arbitrary  entries  aitU 
and  bttj  of  the  matrices  A  =  [a^]  and  B  =  [btj\.  Let  s  be  the  maximum  number  of  array 
edges  between  an  element  of  A  or  B  and  an  element  of  C  on  which  it  depends.  It  follows 
that  at  least  s  steps  are  needed  to  form  C  and  that  every  element  of  A  and  B  is  within  dis¬ 
tance  2s.  Furthermore,  each  of  the  2 n2  elements  of  A  and  B  is  located  initially  in  a  unique 
cell  of  the  expanded  systolic  array.  Since  there  are  at  most  cr(2s)  vertices  within  a  distance 
of  2s,  it  follows  that  <r(2s)  =  2(2s)2  +  2(2s)  +  1  >  2n2,  from  which  we  conclude  that  the 
number  of  steps  to  multiply  n  x  n  matrices  is  at  least  s  >  \(n2  —  \)1^2  —  \  >  j  —  1.  ■ 

7.5.4  Embedding  of  ID  Arrays  in  2D  Meshes 

Given  an  algorithm  for  a  linear  array,  we  ask  whether  that  algorithm  can  be  efficiently  realized 
on  a  2D  mesh.  This  is  easily  determined:  we  need  only  specify  a  mapping  of  the  cells  of  a  linear 
array  to  cells  in  the  2D  mesh.  Assuming  that  the  two  arrays  have  the  same  number  of  cells,  a 
natural  mapping  is  obtained  by  giving  the  cells  of  an  n  x  n  mesh  the  snake-row  ordering.  (See 
Fig.  7.11.)  In  this  ordering  cells  of  the  first  row  are  ordered  from  left  to  right  and  numbered 
from  Oton-  1 ;  those  in  the  second  row  are  ordered  from  right  to  left  and  numbered  from 
n  to  2 n  —  1.  This  process  repeats,  alternating  between  ordering  cells  from  left  to  right  and 
right  to  left  and  numbering  the  cells  in  succession.  Ordering  the  cells  of  a  linear  array  from 
left  to  right  and  numbering  them  from  0  to  n2  —  1  allows  us  to  map  the  linear  array  directly 
to  the  2D  mesh.  Any  algorithm  for  the  linear  array  runs  in  the  same  time  on  a  2D  mesh  if  the 
processors  in  the  two  cases  are  identical. 

Now  we  ask  if,  given  an  algorithm  for  a  2D  mesh,  we  can  execute  it  on  a  linear  array.  The 
answer  is  affirmative,  although  the  execution  time  of  the  algorithm  may  be  much  greater  on  the 
ID  array  than  on  the  2D  mesh.  As  a  first  step,  we  map  vertices  of  the  2D  mesh  onto  vertices 
of  the  1 D  array.  The  snake-row  ordering  of  the  cells  of  an  n  x  n  array  provides  a  convenient 


0  12  3 


Figure  7.11  Snake-row  ordering  of  the  vertices  of  a  two-dimensional  mesh. 
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mapping  of  the  cells  of  the  2D  mesh  onto  the  cells  of  the  linear  array  with  n2  cells.  We  assume 
that  each  of  the  cells  of  the  linear  array  is  identical  to  a  cell  in  the  2D  mesh. 

We  now  address  the  question  of  communication  between  cells.  When  mapped  to  the  ID 
array,  cells  can  communicate  only  with  their  two  immediate  neighbors  in  the  array.  However, 
cells  on  the  nxn  mesh  can  communicate  with  as  many  as  four  neighbors.  Unfortunately,  cells 
in  one  row  of  the  2D  mesh  that  are  neighbors  of  cells  in  an  adjacent  row  are  mapped  to  cells 
that  are  as  far  as  2n  —  1  cells  away  in  the  linear  array.  We  show  that  with  a  factor  of  8n  —  2 
slowdown,  the  linear  array  can  simulate  the  2D  mesh.  A  slowdown  by  at  least  a  factor  of  n/2 
is  necessary  for  those  problems  and  data  for  which  a  datum  moves  from  the  first  to  the  last 
entry  in  the  array  (in  n2  —  1  steps)  to  simulate  a  movement  that  takes  2n  —  1  steps  on  the 
array,  ((n2  —  l)/(2n  —  1)  >  n/2  for  n  >  2.) 

Given  an  algorithm  for  a  2D  mesh,  slow  it  down  as  follows: 

a)  Subdivide  each  cycle  into  six  subcycles. 

b)  In  the  first  of  these  subcycles  let  each  cell  compute  using  its  local  data. 

c)  In  the  second  subcycle  let  each  cell  communicate  with  neighbor(s)  in  adjacent  columns. 

d)  In  the  third  subcycle  let  cells  in  even-numbered  rows  send  messages  to  cells  in  the  next 
higher  numbered  rows. 

e)  In  the  fourth  subcycle  let  cells  in  even-numbered  rows  receive  messages  from  cells  in  the 
next  higher  numbered  rows. 

f)  In  the  fifth  subcycle  let  cells  in  odd-numbered  rows  send  messages  to  cells  in  next  higher 
numbered  rows. 

g)  In  the  sixth  subcycle  let  cells  in  odd-numbered  rows  receive  messages  from  cells  in  next 
higher  numbered  rows. 

When  the  revised  2D  algorithm  is  executed  on  the  linear  array,  computation  occurs  in  the 
first  subcycle  in  unit  time.  During  the  second  subcycle  communication  occurs  in  unit  time 
because  cells  that  are  column  neighbors  in  the  2D  mesh  are  adjacent  in  the  ID  array.  The 
remaining  four  subcycles  involve  communication  between  pairs  of  groups  of  n  cells  each.  This 
can  be  done  for  all  pairs  in  2 n  —  1  time  steps:  each  cell  shifts  a  datum  in  the  direction  of  the 
cell  for  which  it  is  destined.  After  2n  —  1  steps  it  arrives  and  can  be  processed.  We  summarize 
this  result  below. 

THEOREM  7.5.4  Any  T-step  systolic  algorithm  on  an  n  x  n  array  can  be  simulated  on  a  linear 
systolic  array  with  n2  cells  in  at  most  (8 n  —  2)T  steps. 

In  the  next  section  we  demonstrate  that  hypercubes  can  be  embedded  into  meshes.  From 
this  result  we  derive  mesh-based  algorithms  for  a  variety  of  problems  from  hypercube-based 
algorithms  for  these  problems. 

7.6  Hypercube-Based  Machines 

A  d-dimensional  hypercube  has  2d  vertices.  When  they  are  indexed  by  binary  d-tuples  (ad, 
ad- i,  ■  •  ■ ,  ao ),  adjacent  vertices  are  those  whose  tuples  differ  in  one  position.  Thus,  the  2D 
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Figure  7.12  Hypercubes  in  two,  three,  and  four  dimensions. 


hypercube  is  a  square,  the  3D  hypercube  is  the  traditional  3-cube,  and  the  four-dimensional 
hypercube  consists  of  two  3-cubes  with  edges  between  corresponding  pairs  of  vertices.  (See 
Fig.  7.12.)  The  d-dimensional  hypercube  is  composed  of  two  (d—  1) -dimensional  hypercubes 
in  which  each  vertex  in  one  hypercube  has  an  edge  to  the  corresponding  vertex  in  the  other. 
The  degree  of  each  vertex  in  a  d-dimensional  hypercube  is  d  and  its  diameter  is  d  as  well. 

While  the  hypercube  is  a  very  useful  model  for  algorithm  development,  the  construction 
of  hypercube-based  networks  can  be  costly  due  to  the  high  degree  of  the  vertices.  For  example, 
each  vertex  in  a  hypercube  with  4,096  vertices  has  degree  12;  that  is,  each  vertex  is  connected  to 
12  other  vertices,  and  a  total  of  49,152  connections  are  necessary  among  the  4,096  processors. 
By  contrast,  a  26  x  26  2D  mesh  has  the  same  number  of  processors  but  at  most  16,384  wires. 
The  ratio  between  the  number  of  wires  in  a  d-dimensional  hypercube  and  a  square  mesh  with 
the  same  number  of  vertices  is  d/4.  This  makes  it  considerably  more  difficult  to  realize  a 
hypercube  of  high  dimensionality  than  a  2D  mesh  with  a  comparable  number  of  vertices. 

7.6. 1  Embedding  Arrays  in  Hypercubes 

Given  an  algorithm  designed  for  an  array,  we  ask  whether  it  can  be  efficiently  realized  on 
a  hypercube  network.  The  answer  is  positive.  We  show  by  induction  that  if  d  is  even,  a 
2d!  2  x  2d/2  array  can  be  embedded  into  a  d-dimensional,  2d-vertex  hypercube  and  if  d  is  odd, 
a2(d+t)/2  x  2 (d— 1)/2  array  can  be  embedded  into  a  d-dimensional  hypercube.  The  base  cases 
are  d  =  2  and  d  =  3. 
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Figure  7. 1  3  Mappings  of  2  X  2,  4  X  2,  and  4x4  arrays  to  two-,  three-,  and  four-dimensional 
hypercubes.  The  binary  tuples  identify  vertices  of  a  hypercube. 
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When  d  =  2,  a  2d!2  x  2d /2  array  is  a  2  x  2  array  that  is  itself  a  four-vertex  hypercube. 
When  d  =  3,  a  2(d+1)/2  X  2(d"1)/2  array  is  a  4  x  2  array.  (See  Fig.  7.13,  page  299.)  It 
can  be  embedded  into  a  three-dimensional  hypercube  by  mapping  the  top  and  bottom  2x2 
subarrays  to  the  vertices  of  the  two  2-cubes  contained  in  the  3-cube.  The  edges  between  the 
two  subarrays  correspond  directly  to  edges  between  vertices  of  the  2-cubes. 

Applying  the  same  kind  of  reasoning  to  the  inductive  hypothesis,  we  see  that  the  hypothesis 
holds  for  all  values  of  d  >  2.  If  a  2D  array  is  not  of  the  form  indicated,  it  can  be  embedded 
into  such  an  array  whose  sides  are  a  power  of  2  by  at  most  quadrupling  the  number  of  vertices. 

7.6.2  Cube-Connected  Cycles 

A  reasonable  alternative  to  the  hypercube  is  the  cube-connected  cycles  (CCC)  network  shown 
in  Fig.  7.14.  Each  of  its  vertices  has  degree  3,  yet  the  graph  has  a  diameter  only  a  constant  factor 
larger  than  that  of  the  hypercube.  The  ( d ,  r)-CCC  is  defined  in  terms  of  a  d-dimensional  hy¬ 
percube  when  r  >  d.  Let  (ad-  i,  Od-2,  ■  ■  • ,  ao)  and  (bd-  i,  C-2,  ■  ■  ■ ,  &o)  be  the  indices  of  two 
adjacent  vertices  on  the  d-dimensional  hypercube.  Assume  that  these  tuples  differ  in  the  jth 
component,  0  <  j  <  d-  1  ;  that  is,  aj  =  bj  ©  1  and  a,  =  bj  for  *  ^  j.  Associated  with  vertex 
(ad- 1, .  • . ,  ap, . . . ,  ao)  of  the  hypercube  are  the  vertices  (p,  ad- 1,  ■  •  • ,  ap, . .  . ,  ao),  0  <  p  < 
r  —  1,  of  the  CCC  that  form  a  ring;  that  is,  vertex  (p,  ad- 1,  •  •  ■ ,  ap, . .  . ,  ao)  is  adjacent  to 
vertices  ((p  +  1)  mod  r,  ad- i,  ■  ■  •  ap, . . . ,  ao)  and  ((p  —  1)  mod  r,  ad- i,  •  ■  • ,  ap, . . . ,  ao). 
In  addition,  for  0  <  p  <  d  —  1,  vertex  (p,  ad- 1,  •  •  ■ ,  ap, .  . . ,  ao)  is  adjacent  to  vertex 
(p,  ad- 1,  •  ■  ■ ,  ap  ®  1, . . . ,  ao)  on  the  ring  associated  with  vertex  (a<j-i,  ■  ■  ■ ,  ap  ©  1, . . . ,  ao) 
of  the  hypercube. 


Figure  7. 1 4  The  cube-connected  cycles  network  replaces  each  vertex  of  a  d-dimensional  hyper¬ 
cube  with  a  ring  of  r  >  d  vertices  in  which  each  vertex  is  connected  to  its  neighbor  on  the  ring. 
The  jth  ring  vertex,  0  <  j  <  d  —  1,  is  also  connected  to  the  jth  ring  vertex  at  an  adjacent  comer 
of  the  original  hypercube. 
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The  diameter  of  the  CCC  is  at  most  3r/2  +  d,  as  we  now  show.  Given  two  vertices 
V\  =  (p,  ad-i,  ■  ■  ■ ,  do)  and  V2  =  (q,bd-\,  ■  ■  ■  ,ba),  let  their  hypercube  addresses  a  = 
(ad- 1, .  . . ,  do)  and  b  =  (bd- 1, . . . ,  &o)  differ  in  k  positions.  To  move  from  V\  to  i>2,  move 
along  the  ring  containing  V\  by  decreasing  processor  numbers  until  reaching  the  next  lower 
index  at  which  a  and  b  differ.  (Wrap  around  to  the  highest  index,  if  necessary.)  Move  from 
this  ring  to  the  ring  whose  hypercube  address  differs  in  this  index.  Move  around  this  ring  until 
arriving  at  the  next  lower  indexed  processor  at  which  a  and  b  differ.  Continue  in  this  fashion 
until  reaching  the  ring  with  hypercube  address  b.  The  number  of  edges  traversed  in  this  phase 
of  the  movement  is  at  most  one  for  each  vertex  on  the  ring  plus  at  most  one  for  each  of  the 
k  <  d  positions  on  which  the  addresses  differ.  Finally,  move  around  the  last  ring  toward  the 
vertex  V2  along  the  shorter  path.  This  requires  at  most  r/ 2  edge  traversals.  Thus,  the  maximal 
distance  between  two  vertices,  the  diameter  of  the  graph,  is  at  most  3r/2  +  d. 


7.7  Normal  Algorithms 

Normal  algorithms  on  hypercubes  are  systolic  algorithms  with  the  property  that  in  each  cycle 
some  bit  position  in  an  address  is  chosen  and  data  is  exchanged  only  between  vertices  whose 
addresses  differ  in  this  position.  An  operation  is  then  performed  on  this  data  in  one  or  both 
vertices.  Thus,  if  the  hypercube  has  three  dimensions  and  the  chosen  dimension  is  the  second, 
the  following  pairs  of  vertices  exchange  data  and  perform  operations  on  them:  (0,  0,  0)  and 
(0,  1, 0),  (0,  0,  1)  and  (0,  1,  1),  (1, 0,  0)  and  (1,  1, 0),  and  (1, 0, 1)  and  (1,  1,  1).  A  fully  nor¬ 
mal  algorithm  is  a  normal  algorithm  that  visits  each  of  the  dimensions  of  the  hypercube  in 
sequence.  There  are  two  kinds  of  fully  normal  algorithms,  ascending  and  descending  algo¬ 
rithms;  ascending  algorithms  visit  the  dimensions  of  the  hypercube  in  ascending  order,  whereas 
descending  algorithms  visit  them  in  descending  order.  We  show  that  many  important  algo¬ 
rithms  are  fully  normal  algorithms  or  combinations  of  ascending  and  descending  algorithms. 
These  algorithms  can  be  efficiently  translated  into  mesh-based  algorithms,  as  we  shall  see. 

The  fast  Fourier  transform  (FFT)  (see  Section  6.7.3)  is  an  ascending  algorithm.  As  sug¬ 
gested  in  the  butterfly  graph  of  Fig.  7.15,  if  each  vertex  at  each  level  in  the  FFT  graph  on 
n  =  2d  inputs  is  indexed  by  a  pair  (l,  a),  where  a  is  a  binary  d-tuple  and  0  <  l  <  d,  then 
at  level  l  pairs  of  vertices  are  combined  whose  indices  differ  in  their  Zth  component.  (See 
Problem  7.14.)  It  follows  that  the  FFT  graph  can  be  computed  in  levels  on  the  d-dimensional 
hypercube  by  retaining  the  values  corresponding  to  the  column  indexed  by  a  in  the  hypercube 
vertex  whose  index  is  a.  It  follows  that  the  FFT  graph  has  exactly  the  minimal  connectiv¬ 
ity  required  to  execute  an  ascending  fully  normal  algorithm.  If  the  directions  of  all  edges 
are  reversed,  the  graph  is  exactly  that  needed  for  a  descending  fully  normal  algorithm.  (The 
convolution  function  /ion™'*  :  7tm+m  j^n+m- 1  over  a  commutative  ring  1Z  can  also  be 
implemented  as  a  normal  algorithm  in  time  0(log  n)  on  an  n-vertex  hypercube,  n  =  2d .  See 
Problem  7.15.) 

Similarly,  because  the  graph  of  Batcher’s  bitonic  merging  algorithm  (see  Section  6.8.1)  is 
the  butterfly  graph  associated  with  the  FFT,  it  too  is  a  normal  algorithm.  Thus,  two  sorted  lists 
of  length  n  =  2d  can  be  merged  in  d  =  log2  n  steps.  As  stated  below,  because  the  butterfly 
graph  on  2d  inputs  contains  butterfly  subgraphs  on  2k  inputs,  k  <  d,  a  recursive  normal 
sorting  algorithm  can  be  constructed  that  sorts  on  the  hypercube  in  0(log2  n)  steps.  The 
reader  is  asked  to  prove  the  following  theorem.  (See  Problem  6.29.) 
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Figure  7. 1  5  The  FFT  butterfly  graph  with  column  numberings.  The  predecessors  of  vertices 
at  the  fcth  level  differ  in  their  fcth  least  significant  bits. 


THEOREM  7.7. 1  There  exists  a  normal  sorting  algorithm  on  the  p-vertex  hypercube,  p  =  2d,  that 
sorts  p  items  in  time  0(log2  p). 

Normal  algorithms  can  also  be  used  to  perform  a  sum  on  the  hypercube  and  broadcast 
on  the  hypercube,  as  we  show.  We  give  an  ascending  algorithm  for  the  first  problem  and  a 
descending  algorithm  for  the  second. 

7.7.1  Summing  on  the  Hypercube 

Let  the  hypercube  be  d-dimensional  and  let  a  =  {ad-  i,  ad-i,  ■  ■  ■ ,  ao)  denote  an  address  of  a 
vertex.  Associate  with  a  the  integer  |a|  =  ad-\2d~ 1  +  ad-22dl  +  •  •  •  +  ao.  Thus,  when 
d  =  3,  the  addresses  {0,  1,  2, . . . ,  7}  are  associated  with  the  eight  3-tuples  {(0,  0,  0),  (0,  0,  1), 
(0,  1, 0), . . . ,  (1,  1, 1)},  respectively. 

Let  Hlal)  denote  the  value  stored  at  the  vertex  with  address  a.  For  each  (d  —  1)  tuple 
{ad- 1, ...  ,a\),  send  to  vertex  {ad-  \ ,  ■  ■  ■ ,  ai ,  0)  the  value  stored  at  vertex  {ad- 1 , . . . ,  Oi ,  1 ) . 
In  the  summing  problem  we  store  at  vertex  {ad- i,  . . . ,  ai,  0)  the  sum  of  the  original  values 
stored  at  vertices  {ad- 1, . . . ,  a\,  0)  and  {ad- 1, . . . ,  a\,  1).  Below  we  show  the  transmission 


(e.g.  HO)  Hi))  and  addition  (e.g. 

V {0)  <—  V (0)  +  17(1))  that  result  for  d  =  3: 

Ho)  - 

Hi). 

no) 

-  H°)  +  Hi) 

H2)  - 

H3), 

H2) 

«-  17(2)  +  H3) 

H4)  - 

H5), 

H4) 

-  H4)  +  H5) 

H6)  - 

H7), 

H6) 

-  H6)  +  H7) 

For  each  {d  —  2)  tuple  {ad- 1 
stored  at  vertex  {ad-  i,  . . . ,  a2, 
additions: 

,...,a2) 
1,0).  Ag 

we  then  send 
;ain  for  d  =  3, 

to  vertex  {ad- 1, . . . ,  a2,  0,  0)  the  value 
we  have  the  following  data  transfers  and 
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V(0)  <-  V(2),  1/(0)  <-  1/(0)  +  V(2), 

1/(4)  <-  1/(6),  1/(4)  «-  V(4)  +  V(6), 

We  continue  in  this  fashion  until  reaching  the  lowest  dimension  of  the  d-tuples  at  which  point 
we  have  the  following  actions  when  d  =  3: 

1/(0)  «-  1/(4),  1/(0)  <-  1/(0)  +  V(4) 

At  the  end  of  this  computation,  1/(0)  is  the  sum  of  the  values  stored  in  all  vertices.  This 
algorithm  for  computing  V (0)  can  be  extended  to  any  associative  binary  operator. 

7.7.2  Broadcasting  on  the  Hypercube 

The  broadcast  operation  is  obtained  by  reversing  the  directions  of  each  of  the  transmissions 
described  above.  Thus,  in  the  example,  1/(0)  is  sent  to  1/(4)  in  the  first  stage,  in  the  second 
stage  1/(0)  and  1/(4)  are  sent  to  1/(2)  and  1/(6),  respectively,  and  in  the  last  stage,  1/(0), 
V (2),  V (4),  and  V (6)  are  sent  to  1/(1),  V (3),  V (5),  and  V (7),  respectively. 

The  algorithm  given  above  to  broadcast  from  one  vertex  to  all  others  in  a  hypercube  can  be 
modified  to  broadcast  to  just  the  vertices  in  a  subhypercube  that  is  defined  by  those  addresses 
a  =  (ad- 1,  0(2-2,  ■  ■  • ,  Oo)  in  which  all  bits  are  fixed  except  for  those  in  some  k  positions. 
For  example,  {(0,  0,  0),  (0,  1,  0),  (1, 0,  0),  (1, 1, 0)}  are  the  vertices  of  a  subhypercube  of  the 
three-dimensional  hypercube  (the  rightmost  bit  is  fixed).  To  broadcast  to  each  of  these  vertices 
from  (0, 1,0),  say,  on  the  first  step  send  the  message  to  its  pair  along  the  second  dimension, 
namely,  (0,0,0).  On  the  second  step,  let  these  pairs  send  messages  to  their  pairs  along  the 
third  dimension,  namely,  (0,  1,0)  ^  (1,  1,0)  and  (0,  0,  0)  — >  (1,  0,  0).  This  algorithm  can  be 

generalized  to  broadcast  from  any  vertex  in  a  hypercube  to  all  other  vertices  in  a  subhypercube. 
Values  at  all  vertices  of  a  subhypercube  can  be  associatively  combined  in  a  similar  fashion. 

The  performance  of  these  normal  algorithms  is  summarized  below. 

THEOREM  7.7.2  Broadcasting  from  one  vertex  in  a  d-dimensional  bypercube  to  all  other  vertices 
can  be  done  with  a  normal  algorithm  in  O(d)  steps.  Similarly,  the  associative  combination  of  the 
values  stored  at  the  vertices  of  a  d-dimensional  hypercube  can  be  done  with  a  normal  algorithm 
in  0(d)  steps.  Broadcasting  and  associative  combining  can  also  be  done  on  the  vertices  of  k- 
dimensional  subcube  of  the  d-dimensional  hypercube  in  O(k)  steps  with  a  normal  algorithm. 


7.7.3  Shifting  on  the  Hypercube 

Cyclic  shifting  can  also  be  done  on  a  hypercube  as  a  normal  algorithm.  For  n  =  2d,  consider 
shifting  the  n-tuple  x  =  (xn-\, . . . ,  Xo)  cyclically  left  by  k  places  on  a  G?-dimensional  hyper¬ 
cube.  If  k  <  n/2  (see  Fig.  7.16(a)),  the  largest  element  in  the  right  half  of  x,  namely  xn/2~\, 
moves  to  the  left  half  of  x.  On  the  other  hand,  if  k  >  n/2  (see  Fig.  7.16(b)),  xn/2—\  moves 
to  the  right  half  of  x. 

Thus,  to  shift  x  left  cyclically  by  k  places,  k  <  n/2,  divide  x  into  two  (n/2)-tuples, 
shift  each  of  these  tuples  cyclically  by  k  places,  and  then  swap  the  rightmost  k  components 
of  the  two  halves,  as  suggested  in  Fig.  7.16(a).  The  swap  is  done  via  edges  across  the  highest 
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Figure  7.16 


The  two  cases  of  a  normal  algorithm  for  cyclic  shifting  on  a  hypercube. 


dimension  of  the  hypercube.  When  k  >  n/ 2,  cyclically  shift  each  (n/2)-tuple  by  k  —  nj 2 
positions  and  then  swap  the  high-order  n  —  k  positions  from  each  tuple  across  the  highest 
dimension  of  the  hypercube.  We  have  the  following  result. 

THEOREM  7.7.3  Cyclic  shifting  of  an  n-tuple,  n  =  2d,  by  any  amount  can  be  done  recursively  by 
a  normal  algorithm  in  log2  n  communication  steps. 

7.7.4  Shuffle  and  Unshuffle  Permutations  on  Linear  Arrays 

Because  many  important  algorithms  are  normal  and  hypercubes  are  expensive  to  realize,  it 
is  preferable  to  realize  normal  algorithms  on  arrays.  In  this  section  we  introduce  the  shuffle 
and  unshuffle  permutations,  show  that  they  can  be  used  to  realize  normal  algorithms,  and  then 
show  that  they  can  be  realized  on  linear  arrays.  We  use  the  unshuffle  algorithms  to  map  normal 
hypercube  algorithms  onto  one-  and  two-dimensional  meshes. 

Let  IN(n)  =  {0, 1, 2, .  . . ,  n  —  1}  and  n  =  2d.  The  shuffle  permutation  7r^ffle  : 

]N(n)  i— >  IN(n)  moves  the  item  in  position  a  to  position  Truffle  (a)’  where  ^shuffle  (a)  *s  t^le 
integer  represented  by  the  left  cyclic  shift  of  the  d-bit  binary  number  representing  a.  For  exam¬ 
ple,  when  n  =  8  the  integer  3  is  represented  by  the  binary  number  Oil  and  its  left  cyclic  shift 
is  1 10.  Thus,  7TgkufHe(3)  =  6.  The  shuffle  permutation  of  the  sequence  {0, 1,  2,  3,  4,  5,  6,  7} 
is  the  sequence  {0, 4,  1,  5,  2,  6,  3,  7}.  A  shuffle  operation  is  analogous  to  interleaving  of  the 
two  halves  of  a  sorted  deck  of  cards.  Figure  7. 17  shows  this  mapping  for  n  =  8. 

The  unshuffle  permutation  ^"shuffle  •  ^  (n)  l— >  (n)  reverses  the  shuffle  operation:  it 

moves  the  item  in  position  b  to  position  a  where  b  =  Truffle  (®);  that  is,  a  =  7r|7iihuffle(^)  = 
7rUnshuffle(trshuffle(<i))-  Figure  7.18  shows  this  mapping  for  n  =  8.  The  shuffle  permutation 
is  obtained  by  reversing  the  directions  of  edges  in  this  graph. 

An  unshuffle  operation  can  be  performed  on  an  n-cell  linear  array,  n  =  2d,  by  assuming 
that  the  cells  contain  the  integers  {0, 1, 2 , ,n  —  1}  from  left  to  right  represented  as  d- 
bit  binary  integers  and  then  sorting  them  by  their  least  significant  bit  using  a  stable  sorting 
algorithm.  (A  stable  sorting  algorithm  is  one  that  does  not  change  the  original  order  of  keys 
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Figure  7. 1 7  The  shuffle  permutation  can  be  realized  by  a  series  of  swaps  of  the  contents  of  cells. 
The  cells  between  which  swaps  are  done  have  a  heavy  bar  above  them.  The  result  of  swapping  cells 
of  one  row  is  shown  in  the  next  higher  row,  so  that  the  top  row  contains  the  result  of  shuffling  the 
bottom  row. 


with  the  same  value.)  When  this  is  done,  the  sequence  {0, 1, 2,  3, 4,  5,  6,  7}  is  mapped  to  the 
sequence  {0,  2,  4,  6, 1,  3,  5,  7},  the  unshuffled  sequence,  as  shown  in  Fig.  7.18.  The  integer 
b  is  mapped  to  the  integer  a  whose  binary  representation  is  that  of  b  shifted  cyclically  right 
by  one  position.  For  example,  position  1  (001)  is  mapped  to  position  4  (100)  and  position  6 
(110)  is  mapped  to  position  3  (Oil). 

Since  bubble  sort  is  a  stable  sorting  algorithm,  we  use  it  to  realize  the  unshuffle  permuta¬ 
tion.  (See  Section  7.5.2.)  In  each  phase  keys  (binary  tuples)  are  compared  based  on  their  least 
significant  bits.  In  the  first  phase  values  in  positions  i  and  i  +  1  are  compared  for  i  even.  The 
next  comparison  is  between  such  pairs  for  i  odd.  Comparisons  of  this  form  continue,  alternat¬ 
ing  between  even  and  odd  values  for  i,  until  the  sequence  is  sorted.  Since  the  first  phase  has 
no  effect  on  the  integers  {0, 1,  2, . . . ,  n  —  1},  it  is  not  done.  Subsequent  phases  are  shown  in 
Fig.  7.18.  Pairs  that  are  compared  are  connected  by  a  light  line;  a  darker  line  joins  pairs  whose 
values  are  swapped.  (See  Problem  7.16.) 

We  now  show  how  to  implement  efficiently  a  fully  normal  ascending  algorithm  on  a  linear 
array.  (See  Fig.  7.19.)  Let  the  exchange  locations  of  the  linear  array  be  locations  i  and  i  +  1 
of  the  array  for  i  even.  Only  elements  in  exchange  locations  are  swapped.  Swapping  between 
the  first  dimension  of  the  hypercube  is  done  by  swaps  across  exchange  locations.  To  simulate 
exchanges  across  the  second  dimension,  perform  a  shuffle  operation  (by  reversing  the  order  of 
the  operations  of  Fig.  7. 1 8)  on  each  group  of  four  elements.  This  places  into  exchange  locations 
elements  whose  original  indices  differed  by  two.  Performing  a  shuffle  on  eight,  sixteen,  etc. 


Figure  7. 1  8  An  unshuffle  operation  is  obtained  by  bubble  sorting  the  integers  {0,  1,2, ...  ,n  — 
1}  based  on  the  value  of  their  least  significant  bits.  The  cells  with  bars  over  them  are  compared. 
The  first  set  of  comparisons  is  done  on  elements  in  the  bottom  row.  Those  pairs  with  light  bars 
contain  integers  whose  values  are  in  order. 
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Figure  7.19  A  normal  ascending  algorithm  realized  by  shuffle  operations  on  2k  elements, 
k  =  2,  3, 4, .  .  places  into  exchange  locations  elements  whose  indices  differ  by  increasing  powers 
of  two.  Exchange  locations  are  paired  together. 


positions  places  into  exchange  locations  elements  whose  original  indices  differed  by  four,  eight, 
etc.  The  proof  of  correctness  of  this  result  is  left  to  the  reader.  (  See  Problem  7.17.) 

Since  a  shuffle  on  n  =  2d  elements  can  be  done  in  —  1  steps  on  a  linear  array 
with  n  cells  (see  Theorem  7-5.2),  it  follows  that  this  fully  normal  ascending  algorithm  uses 
T(n)  =  4>{d)  steps,  where  T( 2)  =  </>(l)  =  0  and 

</>(d)  =  4>{d  -  1)  +  2d~l  -  l=2d-d-l 

Do  a  fully  normal  descending  algorithm  by  a  shuffle  followed  by  its  steps  in  reverse  order. 

THEOREM  7.7.4  A  fully  normal  ascending  (descending)  algorithm  that  runs  in  d  =  log2  n  steps 
on  a  d-dimensional  hypercube  containing  2d  vertices  can  be  realized  on  a  linear  array  ofn  =  2d 
elements  with  T(n )  =  n  —  log2  n  —  1  (2T(n))  additional  parallel  steps. 

From  the  discussion  of  Section  7.7  it  follows  that  broadcasting,  associative  combining, 
and  the  FFT  algorithm  can  be  executed  on  a  linear  array  in  0(n )  steps  because  each  can  be 
implemented  as  a  normal  algorithm  on  the  n-vertex  hypercube.  Also,  a  list  of  n  items  can 
be  sorted  on  a  linear  array  in  0(n)  steps  by  translating  Batcher’s  sorting  algorithm  based  on 
bitonic  merging,  a  normal  sorting  algorithm,  to  the  linear  array.  (See  Problem  7.20.) 

7.7.5  Fully  Normal  Algorithms  on  Two-Dimensional  Arrays 

We  now  consider  the  execution  of  a  normal  algorithm  on  a  rectangular  array.  We  assume 
that  the  n  =  2ld  vertices  of  a  2d-dimensional  hypercube  are  mapped  onto  an  m  x  m  mesh, 
m  =  2d,  in  row-major  order.  Since  each  cell  is  indexed  by  a  pair  consisting  of  row  and  column 
indices,  (r ,  c) ,  and  each  of  these  satisfies  0  <  r  <  m  —  1  and  0  <  c  <  m  —  1 ,  they  can  each  be 
represented  by  a  d-bit  binary  number.  Let  r  and  c  be  these  binary  numbers.  Thus  cell  (r,  c) 
is  indexed  by  the  2d-bit  binary  number  rc. 

Cells  in  positions  (r,  c)  and  (r,  c  +  1)  have  associated  binary  numbers  that  agree  in  their 
d  most  significant  positions.  Cells  in  positions  (r,c)  and  (r  +  l,c)  have  associated  binary 
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numbers  that  agree  in  their  d  least  significant  positions.  To  simulate  a  normal  hypercube  algo¬ 
rithm  on  the  2D  mesh,  in  each  row  simulate  a  normal  hypercube  algorithm  on  2d  vertices  after 
which  in  each  column  simulate  a  normal  hypercube  algorithm  on  2d  vertices.  The  correctness 
of  this  procedure  follows  because  every  adjacent  pair  of  vertices  of  the  simulated  hypercube  is 
at  some  time  located  in  adjacent  cells  of  the  2D  array. 

From  Theorem  7.7.4  it  follows  that  hypercube  exchanges  across  the  lower  d  dimensions 
can  be  simulated  in  time  proportional  to  the  length  of  a  row,  that  is,  in  time  0(y/n).  Similarly, 
it  also  follows  that  hypercube  exchanges  across  the  higher  d  dimensions  can  be  simulated  in 
time  proportional  to  O(fn).  We  summarize  this  result  below. 

THEOREM  7.7.5  A  fully  normal  2d-dimensional  bypercube  algorithm  (ascending  or  descending), 
n  =  2ld,  can  be  realized  in  O(fn)  steps  on  an  \fn  x  fn  array  of  cells. 

It  follows  from  the  discussion  of  Section  7.7  that  broadcasting,  associative  combining, 
and  the  FFT  algorithm  can  be  executed  on  a  2D  mesh  in  O(fyn)  steps  because  each  can  be 
implemented  as  a  normal  algorithm  on  the  n-vertex  hypercube. 

Also,  a  list  of  n  items  can  be  sorted  on  an  fn  X  fn  array  in  O(fn)  steps  by  translating 
a  normal  merging  algorithm  to  the  \fn  x  fn  array  and  using  it  recursively  to  create  a  sorting 
network.  (See  Problem  7.21.)  No  sorting  algorithm  can  sort  in  fewer  than  2fm  —  2  steps  on 
an  frn  x  fm  array  because  whatever  element  is  positioned  in  the  lower  right-hand  corner  of 
the  array  could  originate  in  the  upper  left-hand  corner  and  have  to  traverse  at  least  2  fm  —  2 
edges  to  arrive  there. 

7.7.6  Normal  Algorithms  on  Cube-Connected  Cycles 

Consider  now  processors  connected  as  a  d-dimensional  cube-connected  cycle  (CCC)  network 
in  which  each  ring  has  r  =  2k  >  d  processors.  In  particular,  let  r  be  the  smallest  power  of  2 
greater  than  or  equal  to  d,  so  that  d  <  r  <  2d.  (Thus  k  =  0(log  d).)  We  call  such  a  CCC 
network  a  canonical  CCC  network  on  n  vertices.  It  has  n  =  r2d  vertices,  d2d  <n<  (2d)2d. 
(Thus  d  =  0(log  n).)  We  show  that  a  fully  normal  algorithm  can  be  executed  efficiently  on 
such  CCC  networks. 

Let  each  ring  of  the  CCC  network  be  indexed  by  a  d-tuple  corresponding  to  the  corner 
of  the  hypercube  at  which  it  resides.  Let  each  processor  be  indexed  by  a  (d  +  k) -tuple  in 
which  the  d  low-order  bits  are  the  ring  index  and  the  k  high-order  bits  specify  the  position  of 
a  processor  on  the  ring. 

A  fully  normal  algorithm  on  a  canonical  CCC  network  is  implemented  in  two  phases.  In 
the  first  phase,  the  ring  is  treated  as  an  array  and  a  fully  normal  algorithm  on  the  k  high-order 
bits  is  simulated  in  O(d)  steps.  In  the  second  phase,  exchanges  are  made  across  hypercube 
edges.  Rotate  the  elements  on  each  ring  so  that  ring  processors  whose  /c-bit  indices  are  0  (call 
these  the  lead  elements)  are  adjacent  along  the  first  dimension  of  the  original  hypercube.  Ex¬ 
change  information  between  them.  Now  rotate  the  rings  by  one  position  so  that  lead  elements 
are  adjacent  along  the  second  dimension  of  the  original  hypercube.  The  elements  immediately 
behind  the  lead  elements  on  the  rings  are  now  adjacent  along  the  first  hypercube  dimension 
and  are  exchanged  in  parallel  with  the  lead  elements.  (This  simultaneous  execution  is  called 
pipelining.)  Subsequent  rotations  of  the  rings  place  successive  ring  elements  in  alignment 
along  increasing  bit  positions.  After  0(d)  rotations  all  exchanges  are  complete.  Thus,  a  total 
of  O(d)  time  steps  suffice  to  execute  a  fully  normal  algorithm.  We  have  the  following  result. 
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THEOREM  7.7.6  A  fully  normal  algorithm  (ascending  or  descending)  for  an  n-vertex  hypercube 
can  be  realized  in  0(log  n)  steps  on  a  canonical  n-vertex  cube-connected  cycle  network. 

Thus,  a  fully  normal  algorithm  on  an  n-vertex  hypercube  can  be  simulated  on  a  CCC 
network  in  time  proportional  to  the  time  on  the  hypercube.  However,  the  vertices  of  the  CCC 
have  bounded  degree,  which  makes  them  much  easier  to  realize  in  hardware  than  high-degree 
networks. 


7.7.7  Fast  Matrix  Multiplication  on  the  Hypercube 

Matrix  multiplication  can  be  done  more  quickly  on  the  hypercube  than  on  a  two-dimensional 
array.  Instead  of  0(n)  steps,  only  O(logn)  steps  are  needed,  as  we  show. 

Consider  the  multiplication  of  nxn  matrices  A  and  B  for  n  =  2r  to  produce  the  product 
matrix  C  =  A  x  B.  We  describe  a  normal  systolic  algorithm  to  multiply  these  matrices  on  a 
d-dimensional  hypercube,  d  =  3r. 

Since  d  =  3 r,  the  vertices  of  the  d-dimensional  hypercube  are  addressed  by  a  binary  3 r- 
tuple,  a  =  (a^r-i,  a^r-2,  ■  •  ■ ,  do).  Let  the  r  least  significant  bits  of  a  denote  an  integer  i,  let 
the  next  r  Isb’s  denote  an  integer  j,  and  let  the  r  most  significant  bits  denote  an  integer  k. 
Then,  we  have  |a|  =  kn 2  +  jn  +  i  since  n  =  2r .  Because  of  this  identity,  we  represent  the 
address  a  by  the  triple  ( i,j ,  k).  We  speak  of  the  processor  P-iyy  located  at  the  vertex  ( i,j ,  k) 
of  the  d-dimensional  hypercube,  d  =  3 r.  We  denote  by  HCyj-  the  subhypercube  in  which  i 
and  j  are  fixed  and  by  HCy-tk  and  HC-jtk  the  subhypercubes  in  which  the  two  other  pairs 
of  indices  are  fixed.  There  are  2lr  subhypercubes  of  each  kind. 

We  assume  that  each  processor  Pyjlk  contains  three  local  variables,  Ayjtk,  Byjik,  and 
Cijtk ■  We  also  assume  that  initially  Ajjy  =  ayj  and  Btyy  =  byj,  where  0  <  *,  j  <  n  —  1. 
The  multiplication  algorithm  has  the  following  five  phases: 

a)  For  each  subhypercube  HCyj-  and  for  I  <  k  <  n  —  1,  broadcast  A^y o  (containing  ayf) 
to  Ayyk  and  Byj'O  (containing  byj)  to  Bityk. 

b)  For  each  subhypercube  HCi-y  and  for  0  <  j  <  n  —  1,  j  f  k,  broadcast  Ay k,k 
(containing  dyf)  to  Ayyk. 

c)  For  each  subhypercube  HC-yy  and  for  0  <  *  <  n  —  1,  i  f  k,  broadcast  Bk  j  k  (con¬ 
taining  bk,j)  tO  Byyk. 

d)  At  each  processor  Pyjtk  compute  Cyj>k  =  Ayjtk  ■  Byyk  =  aykbky. 

e)  At  processor  Pyyo  compute  the  sum  Cyj$  =  Cyyk  {Cyjt  o  now  contains  Cyj  = 

Qi.kbk.j) • 

From  Theorem  7.7.2  it  follows  that  each  of  these  five  steps  can  be  done  in  0(r)  steps, 
where  r  =  log2  n.  We  summarize  this  result  below. 

THEOREM  7.7.7  Two  nxn  matrices,  n  =  2r ,  can  be  multiplied  by  a  normal  systolic  algorithm  on 
a  d-dimensional  hypercube,  d  =  3 r,  with  n?  processors  in  0(log  n)  steps.  All  normal  algorithms 
for  nxn  matrix  multiplication  require  H(log  n)  steps. 

Proof  The  upper  bound  follows  from  the  construction.  The  lower  bound  follows  from  the 
observation  that  each  processor  that  is  participating  in  the  execution  of  a  normal  algorithm 
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combines  two  values,  one  that  it  owns  and  one  owned  by  one  of  its  neighbors.  Thus,  if 
t  steps  are  executed  to  compute  a  value,  that  value  cannot  depend  on  more  than  2*  other 
values.  Since  each  entry  in  an  n  x  n  product  matrix  is  a  function  of  2 n  other  values,  t  must 
be  at  least  log2(2n).  ■ 

The  lower  bound  stated  above  applies  only  to  normal  algorithms.  If  a  non-normal  algo¬ 
rithm  is  used,  each  processor  can  combine  up  to  d  values.  Thus,  after  k  steps,  up  to  dk  values 
can  be  combined.  If  2 n  values  must  be  combined,  as  in  n  X  n  matrix  multiplication,  then 
k  >  logd(2n)  =  (log2  2n)/log2  d.  If  an  n3-processor  hypercube  is  used  for  this  problem, 
d  =  3  log2  n  and  k  =  fl  (log  n/  log  log  n) . 

The  normal  matrix  multiplication  algorithm  described  above  can  be  translated  to  linear 
arrays  and  2D  meshes  using  the  mappings  based  on  the  shuffle  and  unshuffle  operations.  The 
2D  mesh  version  has  a  running  time  0(y/nlogn),  which  is  inferior  to  the  running  time  of 
the  algorithm  given  in  Section  7.5.3. 


7.8  Routing  in  Networks 

A  topic  of  major  concern  in  the  design  of  distributed  memory  machines  is  routing,  the  task  of 
transmitting  messages  among  processors  via  nodes  of  a  network.  Routing  becomes  challenging 
when  many  messages  must  travel  simultaneously  through  a  network  because  they  can  produce 
congestion  at  nodes  and  cause  delays  in  the  receipt  of  messages. 

Some  routing  networks  are  designed  primarily  for  the  permutation-routing  problem,  the 
problem  of  establishing  a  one-to-one  correspondence  between  n  senders  and  n  receivers.  (A 
processor  can  be  both  a  sender  and  receiver.)  Each  sender  sends  one  message  to  a  unique 
receiver  and  each  receiver  receives  one  message  from  a  unique  sender.  (We  examine  in  Sec¬ 
tion  7.9.3  routing  methods  when  the  numbers  of  senders  and  receivers  differ  and  more  than 
one  message  can  be  received  by  one  processor.)  If  many  messages  are  targeted  at  one  receiver, 
a  long  delay  will  be  experienced  at  this  receiver.  It  should  be  noted  that  network  congestion 
can  occur  at  a  node  even  when  messages  are  uniformly  distributed  throughout  the  network, 
because  many  messages  may  have  to  pass  through  this  node  to  reach  their  destinations. 

7.8. 1  Local  Routing  Networks 

In  a  local  routing  network  each  message  is  accompanied  by  its  destination  address.  At  each 
network  node  (switch)  the  routing  algorithm,  using  only  these  addresses  and  not  knowing  the 
global  state  of  the  network,  finds  a  path  for  messages. 

A  sorting  network,  suitably  modified  to  transmit  messages,  is  a  local  permutation-routing 
network.  Batcher’s  bitonic  sorting  network  described  in  Section  6.8.1  will  serve  as  such  a 
network.  As  mentioned  in  Section  7.7,  this  network  can  be  realized  as  a  normal  algorithm  on 
a  hypercube,  with  running  time  on  an  n-vertex  hypercube  0( log2  n).  (See  Problem  6.28.) 
On  the  two-dimensional  mesh  its  running  time  is  0{y/n)  (see  Problem  7.21),  whereas  on  the 
linear  array  it  is  0(n )  (see  Problem  7.20). 

Batcher’s  bitonic  sorting  network  is  data-oblivious;  that  is,  it  performs  the  same  set  of  op¬ 
erations  for  all  values  of  the  input  data.  The  outcomes  of  these  operations  are  data-dependent, 
but  the  operations  themselves  are  data-independent.  Non-oblivious  sorting  algorithms  per¬ 
form  operations  that  depend  on  the  values  of  the  input  data.  An  example  of  a  local  non- 
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oblivious  algorithm  is  one  that  sends  a  message  from  the  current  network  node  to  the  neigh¬ 
boring  node  that  is  closest  to  the  destination. 

7.8.2  Global  Routing  Networks 

In  a  global  routing  network,  knowledge  of  the  destinations  of  all  messages  is  used  to  set  the 
network  switches  and  select  paths  for  the  messages  to  follow.  A  global  permutation-routing 
network  realizes  permutations  of  the  destination  addresses.  We  now  give  an  example  of  such  a 
network,  the  Benes  permutation  network. 

A  permutation  network  is  constructed  of  two-input,  two-output  switches.  Such  a  switch 
either  passes  its  inputs,  labeled  A  and  B,  to  its  outputs,  labeled  X  and  Y,  or  it  swaps  them.  That 
is,  the  switch  is  set  so  that  either  X  =  A  and  Y  =  B  or  X  =  B  and  Y  =  A.  A  permutation 
network  on  n  inputs  and  n  outputs  is  a  directed  acyclic  graph  of  these  switches  such  that  for 
each  permutation  of  the  n  inputs,  switches  can  be  set  to  create  n  disjoint  paths  from  the  n 
inputs  to  the  n  outputs. 

A  Benes  permutation  network  is  shown  in  Fig.  7.20.  This  graph  is  produced  by  con¬ 
necting  two  copies  of  an  FFT  graph  on  2fc_1  inputs  back  to  back  and  replacing  the  nodes 
by  switches  and  edges  by  pairs  of  edges.  (FFT  graphs  are  described  in  Section  6.7.3.)  It  fol¬ 
lows  that  a  Benes  permutation  network  on  n  inputs  can  be  realized  by  a  normal  algorithm 


Figure  7.20  A  Benes  permutation  network. 
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that  executes  O(logn)  steps.  Thus,  a  permutation  is  computed  much  more  quickly  (in  time 
0(log  n))  with  the  Benes  offline  permutation  network  than  it  can  be  done  on  Batcher’s  online 
bitonic  sorting  network  (in  time  0(log2n)).  However,  the  Benes  network  requires  time  to 
collect  the  destinations  at  some  central  location,  compute  the  switch  settings,  and  transmit 
them  to  the  switches  themselves. 

To  understand  how  the  Benes  network  works,  we  provide  an  alternative  characterization 
of  it.  Let  Pn  be  the  Benes  network  on  n  inputs,  n  =  2k,  defined  as  back-to-back  FFT  graphs 
with  nodes  replaced  by  switches.  Then  Pn  may  be  defined  recursively,  as  suggested  in  Fig.  7.20. 
Pn  is  obtained  by  making  two  copies  of  Pn/ 2,  placing  n/ 2  copies  of  a  two-input,  two-output 
switch  at  the  input  and  the  same  number  at  the  output.  For  1  <  i  <  n/4  (n/4+ 1  <  i  <  n/2) 
the  top  output  of  switch  i  is  connected  to  the  top  input  of  the  ith  switch  in  the  upper  (lower) 
copy  of  Pn/2  and  the  bottom  output  is  connected  to  the  bottom  input  of  the  ith  switch  in  the 
lower  (upper)  copy  of  Pn/2-  The  connections  of  output  switches  are  the  mirror  image  of  the 
connections  of  the  input  switches. 

Consider  the  Benes  network  P2.  It  consists  of  a  single  switch  and  generates  the  two  possible 
permutations  of  the  inputs.  We  show  by  induction  that  Pn  generates  all  n !  permutations  of  its 
n  inputs.  Assume  that  this  property  holds  for  n  =  2,  4, ... ,  2k~l .  We  show  that  it  holds  for 
m  =  2k .  Let  7t  =  ©(1),  7t(2),  . . . ,  7 r(m))  be  an  arbitrary  permutation  to  be  realized  by  Pm. 
This  means  that  the  ith  input  must  be  connected  to  the  7r(t)th  output.  Suppose  that  7r(3)  is 
2,  as  shown  in  Fig.  7.20.  We  can  arbitrarily  choose  to  have  the  third  input  pass  through  the 
first  or  second  copy  of  Pm/2-  We  choose  the  second.  The  path  taken  through  the  second  copy 
of  Pm/2  must  emerge  on  its  second  output  so  that  it  can  then  pass  to  the  first  switch  in  the 
column  of  output  switches.  This  output  switch  must  pass  its  inputs  without  swapping  them. 
The  other  output  of  this  switch,  namely  1,  must  arrive  via  a  path  through  the  first  copy  of 
Pm/2  and  emerge  on  its  first  output.  To  determine  the  input  at  which  it  must  arrive,  we  find 
the  input  of  Pm  associated  with  the  output  of  1  and  set  its  switch  so  that  it  is  directed  to  the 
first  copy  of  Pm/ 2.  Since  the  other  input  to  this  input  switch  must  go  to  the  other  copy  of 
Pm/ 2,  we  follow  its  path  through  Pm  to  the  output  and  then  reason  in  the  same  way  about  the 
other  output  at  the  output  switch  at  which  it  arrives.  If  by  tracing  paths  back  and  forth  this 
way  we  do  not  exhaust  all  inputs  and  outputs,  we  pick  another  input  and  repeat  the  process 
until  all  inputs  have  been  routed  to  outputs. 

Now  let’s  determine  the  number  of  switches,  S(k),  in  a  Benes  network  Pn  on  n  =  2k 
inputs.  It  follows  that  S'(l)  =  1  and 

S(k)  =  2S(k  -  1)  +  2fc 

It  is  straightforward  to  show  that  S'(fc)  =  (k  —  j)2k  =  n(log2  n  —  j). 

Although  a  global  permutation  network  sends  messages  to  their  destinations  more  quickly 
than  a  local  permutation  network,  the  switch  settings  must  be  computed  and  distributed  glob¬ 
ally,  both  of  which  impose  important  limitations  on  the  time  to  realize  particular  permutations. 

V 9  The  PRAM  Model 

The  parallel  random-access  machine  (PRAM)  (see  Fig.  7.21),  the  canonical  structured  par¬ 
allel  machine,  consists  of  a  bounded  set  of  processors  and  a  common  memory  containing  a 
potentially  unlimited  number  of  words.  Each  processor  is  similar  to  the  random-access  ma¬ 
chine  (RAM)  described  in  Section  3.4  except  that  its  CPU  can  access  locations  in  both  its  local 
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Figure  7.2  I  The  PRAM  consists  of  synchronous  RAMs  accessing  a  common  memory. 


random-access  memory  and  the  common  memory.  During  each  PRAM  step,  the  RAMs  exe¬ 
cute  the  following  steps  in  synchrony:  they  (a)  read  from  the  common  memory,  (b)  perform 
a  local  computation,  and  (c)  write  to  the  common  memory.  Each  RAM  has  its  own  program 
and  program  counter  as  well  as  a  unique  identifying  number  idj  that  it  can  access  to  make 
processor-dependent  decisions.  The  PRAM  is  primarily  an  abstract  programming  model,  not 
a  machine  designed  to  be  built  (unlike  mesh-based  computers,  for  example). 

The  power  of  the  PRAM  has  been  explored  by  considering  a  variety  of  assumptions  about 
the  length  of  local  computations  and  the  type  of  instruction  allowed.  In  designing  parallel 
algorithms  it  is  generally  assumed  that  each  local  computation  consists  of  a  small  number  of 
instructions.  However,  when  this  restriction  is  dropped  and  the  PRAM  is  allowed  an  unlim¬ 
ited  number  of  computations  between  successive  accesses  to  the  common  memory  (the  ideal 
PRAM),  the  information  transmitted  between  processors  reflects  the  minimal  amount  of  in¬ 
formation  that  must  be  exchanged  to  solve  a  problem  on  a  parallel  computer. 

Because  the  size  of  memory  words  is  potentially  unbounded,  very  large  numbers  can  be 
generated  very  quickly  on  a  PRAM  if  a  RAM  can  multiply  and  divide  integers  and  perform 
vector  operations.  This  allows  each  RAM  to  emulate  a  parallel  machine  with  an  unbounded 
number  of  processors.  Since  the  goal  is  to  understand  the  power  of  parallelism,  however,  this 
form  of  hidden  parallelism  is  usually  disallowed,  either  by  not  permitting  these  instructions  or 
by  assuming  that  in  t  steps  a  PRAM  generates  numbers  whose  size  is  bounded  by  a  polynomial 
in  t.  To  simplify  the  discussion,  we  limit  instructions  in  a  RAM’s  repertoire  to  addition, 
subtraction,  vector  comparison  operations,  conditional  branching,  and  shifts  by  fixed  amounts. 
We  also  allow  load  and  store  instructions  for  moving  words  between  registers,  local  memories, 
and  the  common  memory.  These  instructions  are  sufficiently  rich  to  compute  all  computable 
functions. 

As  yet  we  have  not  specified  the  conditions  under  which  access  to  the  common  memory  oc¬ 
curs  in  the  first  and  third  substeps  of  each  PRAM  step.  If  access  by  more  than  one  RAM  to  the 
same  location  is  disallowed,  access  is  exclusive.  If  this  restriction  does  not  apply,  access  is  con¬ 
current.  Four  combinations  of  these  classifications  apply  to  reading  and  writing.  The  strongest 
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restriction  is  placed  on  the  Exclusive  Read/Exclusive  Write  (EREW)  PRAM,  with  succes¬ 
sively  weaker  restrictions  placed  on  the  Concurrent  Read/Exclusive  Write  (CREW)  PRAM, 
the  Exclusive  Read/Concurrent  Write  (ERCW)  PRAM,  and  the  Concurrent  Read/Con- 
current  Write  (CRCW)  PRAM.  When  concurrent  writing  is  allowed,  conflicts  are  resolved 
in  one  of  the  following  ways:  a)  the  COMMON  model  requires  that  all  RAMs  writing  to  a 
common  location  write  the  same  value,  b)  the  ARBITRARY  model  allows  an  arbitrary  value 
to  be  written,  and  c)  the  PRIORITY  model  writes  into  the  common  location  the  value  being 
written  by  the  lowest  numbered  RAM. 

Observe  that  any  algorithm  written  for  the  COMMON  CRCW  PRAM  runs  without 
change  on  the  ARBITRARY  CRCW  PRAM.  Similarly,  an  ARBITRARY  CRCW  PRAM  al¬ 
gorithm  runs  without  change  on  the  PRIORITY  CRCW  PRAM.  Thus,  the  latter  is  the  most 
powerful  of  the  PRAM  models. 

In  performing  a  computation  on  a  PRAM  it  is  typically  assumed  that  the  input  is  written 
in  the  lowest  numbered  locations  of  the  common  memory.  PRAM  computations  are  charac¬ 
terized  by  p,  the  number  of  processors  (RAMs)  in  use,  and  T  (time),  the  number  of  PRAM 
steps  taken.  Both  measures  are  usually  stated  as  a  function  of  the  size  of  a  problem  instance, 
namely  m,  the  number  of  input  words,  and  n,  their  total  length  in  bits. 

After  showing  that  tree,  array,  and  hypercube  algorithms  translate  directly  to  a  PRAM 
algorithm  with  no  loss  in  efficiency,  we  explore  the  power  of  concurrency.  This  is  followed  by  a 
brief  discussion  of  the  simulation  of  a  PRAM  on  a  hypercube  and  a  circuit  on  a  CREW  PRAM. 
We  close  by  referring  the  reader  to  connections  established  between  PRAMs  and  circuits  and 
to  the  discussion  of  serial  space  and  parallel  time  in  Chapter  8. 


7.9.1  Simulating  Trees,  Arrays,  and  Hypercubes  on  the  PRAM 

We  have  shown  that  ID  arrays  can  be  embedded  into  2D  meshes  and  that  d-dimensional 
meshes  can  be  embedded  into  hypercubes  while  preserving  the  neighborhood  structure  of  the 
first  graph  in  the  second.  Also,  we  have  demonstrated  that  any  balanced  tree  algorithm  can  be 
simulated  as  a  normal  algorithm  on  a  hypercube.  As  a  consequence,  in  each  case,  an  algorithm 
designed  for  the  first  network  carries  over  to  the  second  without  any  increase  in  the  number  of 
steps  executed.  We  now  show  that  normal  hypercube  algorithms  are  efficiently  simulated  on 
an  EREW  PRAM. 

With  each  d-dimensional  hypercube  processor,  associate  an  EREW  PRAM  processor  and 
a  reserved  location  in  the  common  memory.  In  a  normal  algorithm  each  hypercube  processor 
communicates  with  its  neighbor  along  a  specified  direction.  To  simulate  this  communication, 
each  associated  PRAM  processor  writes  the  data  to  be  communicated  into  its  reserved  location. 
The  processor  for  which  the  message  is  destined  knows  which  hypercube  neighbor  is  providing 
the  data  and  reads  the  value  stored  in  its  associated  memory  location. 

When  a  hypercube  algorithm  is  not  normal,  as  many  as  d  —  1  neighbors  can  send  messages 
to  one  processor.  Since  EREW  PRAM  processors  can  access  only  one  cell  per  unit  time, 
simulation  of  the  hypercube  can  require  a  running  time  that  is  about  d  times  that  of  the 
hypercube. 

THEOREM  7.9. 1  Every  T-step  normal  algorithm  on  the  d-dimensional,  n-vertex  hypercube,  n  = 
2d,  can  be  simulated  in  0(T )  steps  on  an  n-processor  EREW  PRAM.  Every  T-step  hypercube 
algorithm,  normal  or  not,  can  be  simulated  in  0(Td )  steps. 
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An  immediate  consequence  of  Theorems  7.7.1  and  7.9.1  is  that  a  list  of  n  items  can  be 
sorted  on  an  n-processor  PRAM  in  0(log2  n)  steps  by  a  normal  oblivious  algorithm.  Data- 
dependent  sorting  algorithms  for  the  hypercube  exist  with  running  time  0(log  n). 

It  also  follows  from  Section  7.6. 1  that  algorithms  for  trees,  linear  arrays,  and  meshes  trans¬ 
late  directly  into  PRAM  algorithms  with  the  same  running  time  as  on  these  less  general  models. 
Of  course,  the  superior  connectivity  between  PRAM  processors  might  be  used  to  produce  faster 
algorithms. 

7.9.2  The  Power  of  Concurrency 

The  CRCW  PRAM  is  a  very  powerful  model.  As  we  show,  any  Boolean  function  can  be 
computed  with  it  in  a  constant  number  of  steps  if  a  sufficient  number  of  processors  is  available. 
For  this  reason,  the  CRCW  PRAM  is  of  limited  interest:  it  represents  an  extreme  that  does 
not  reflect  reality  as  we  know  it.  The  CREW  and  EREW  PRAMs  are  more  realistic.  We 
first  explore  the  power  of  the  CRCW  and  then  show  that  an  EREW  PRAM  can  simulate  a 
p-processor  CRCW  PRAM  with  a  slowdown  by  a  factor  of  0( log2  p). 

THEOREM  7.9.2  The  CRCW  PRAM  can  compute  an  arbitrary  Boolean  function  in  four  steps. 

Proof  Given  a  Boolean  function  f  :  Bn  >—>  B,  represent  it  by  its  disjunctive  normal  form; 
that  is,  represent  it  as  the  OR  of  its  minterms  where  a  minterm  is  the  AND  of  each  literal  of 
/.  (A  literal  is  a  variable,  x,,  or  its  complement,  Xi .)  Assume  that  each  variable  is  stored  in 
a  separate  location  in  the  common  memory. 

Given  a  minterm,  we  show  that  it  can  be  computed  by  a  CRCW  PRAM  in  two  steps. 
Assign  one  location  in  the  common  memory  to  the  minterm  and  initialize  it  to  the  value  1. 
Assign  one  processor  to  each  literal  in  the  minterm.  The  processor  assigned  to  the  jth  literal 
reads  the  value  of  the  jth  variable  from  the  common  memory.  If  the  value  of  the  literal  is  0, 
this  processor  writes  the  value  0  to  the  memory  location  associated  with  the  minterm.  Thus, 
the  minterm  has  value  1  exactly  when  each  literal  has  value  1 .  Note  that  these  processors  read 
concurrently  with  processors  associated  with  other  minterms  and  may  write  concurrently  if 
more  than  one  of  their  literals  has  value  0. 

Now  assume  that  a  common  memory  location  has  been  reserved  for  the  function  itself 
and  initialized  to  0.  One  processor  is  assigned  to  each  minterm  and  if  the  value  of  its 
minterm  is  1,  it  writes  the  value  1  in  the  location  associated  with  the  function.  Thus,  in  two 
more  steps  the  function  /  is  computed.  ■ 

Given  the  power  of  concurrency,  especially  as  applied  to  writing,  we  now  explore  the  cost 
in  performance  of  not  allowing  concurrency,  whether  in  reading  or  writing. 

THEOREM  7.9.3  A  p-processor  priority  CRCW  PRAM  can  be  simulated  by  a  p-processor  EREW 
PRAM  with  a  sloiudown  by  a  factor  equal  to  the  time  to  sort  p  elements  on  this  machine.  Conse¬ 
quently,  this  simulation  can  be  done  by  a  normal  algorithm  with  a  slowdown  factor  ofO(  log2  p). 

Proof  The  jth  EREW  PRAM  processor  simulates  a  memory  access  by  the  jth  CRCW 
PRAM  processor  by  first  writing  into  a  special  location,  Mj,  a  pair  (a;,  j)  indicating  that 
processor  j  wishes  to  access  (read  or  write)  location  a,j .  If  processors  are  writing  to  common 
memory,  the  value  to  be  written  is  attached  to  this  pair.  If  processors  are  reading  from 
common  memory,  a  return  message  containing  the  requested  value  is  provided.  If  a  processor 
chooses  not  to  access  any  location,  a  dummy  address  larger  than  all  other  addresses  is  used  for 
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a,j.  The  contents  of  the  locations  M\,  M2, . . . ,  Mp  are  sorted,  which  creates  a  subsequence 
in  which  pairs  with  a  common  address  occur  together  and  within  which  the  pairs  are  sorted 
by  processor  numbers.  From  Theorem  7.7.1  it  follows  that  this  step  can  be  performed  in 
time  0(log2  p )  by  a  normal  algorithm.  So  far  no  concurrent  reads  or  writes  occur. 

A  processor  is  now  assigned  to  each  pair  in  the  sorted  sequence.  We  consider  two  cases: 
a)  processors  are  reading  from  or  b)  writing  to  common  memory.  Each  processor  now 
compares  the  address  of  its  pair  to  that  of  the  preceding  pair.  If  a  processor  finds  these 
addresses  to  be  different  and  case  a  holds,  it  reads  the  item  in  common  memory  and  sets  a 
flag  bit  to  1 ;  all  other  processors  except  the  first  set  their  flag  bits  to  0;  the  first  sets  its  bit  to  1 . 
(This  bit  is  used  later  to  distribute  the  value  that  was  read.)  However,  if  case  b  holds  instead, 
the  processor  writes  its  value.  Since  this  processor  has  the  lowest  index  of  all  processors  and 
the  priority  CRCW  is  the  strongest  model,  the  value  written  is  the  same  value  written  by 
either  the  common  or  arbitrary  CRCW  models. 

Returning  now  to  case  a,  the  flag  bits  mark  the  first  pair  in  each  subsequence  of  pairs 
that  have  the  same  address  in  the  common  memory.  Associated  with  the  leading  pair  is  the 
value  read  at  this  address.  We  now  perform  a  segmented  prefix  computation  using  as  the 
associative  rule  the  copy-right  operation.  (See  Problem  2.20.)  It  distributes  to  each  pair 
( a>j,j )  the  value  the  processor  wished  to  read  from  the  common  memory.  By  Problem  2.21 
this  problem  can  be  solved  by  a  p-processor  EREW  PRAM  in  O(logp)  steps.  The  pairs 
and  their  accompanying  value  are  then  sorted  by  the  processor  number  so  that  the  value 
read  from  the  common  memory  is  in  a  location  reserved  for  the  processor  that  requested  the 
value.  ■ 


7.9.3  Simulating  the  PRAM  on  a  Hypercube  Network 

As  stated  above,  each  PRAM  cycle  involves  reading  from  the  global  memory,  performing  a 
local  computation,  and  writing  to  the  common  memory.  Of  course,  a  processor  need  not 
access  common  memory  when  given  the  chance.  Thus,  to  simulate  a  PRAM  on  a  network 
computer,  one  has  to  take  into  account  the  fact  that  not  all  PRAM  processors  necessarily  read 
from  or  write  to  common  memory  locations  on  each  cycle. 

It  is  important  to  remember  that  the  latency  of  network  computers  can  be  large.  Thus,  for 
the  simulation  described  below  to  be  useful,  each  PRAM  processor  must  be  able  to  do  a  lot  of 
work  between  network  accesses. 

The  EREW  PRAM  is  simulated  on  a  network  computer  by  executing  three  phases,  two  of 
which  correspond  to  reading  and  writing  common  memory.  (To  simulate  the  CRCW  PRAM, 
we  need  only  add  the  time  given  above  to  simulate  a  CRCW  PRAM  by  an  EREW  PRAM.) 
We  simulate  an  access  to  common  memory  by  routing  a  message  over  the  network  to  the  site 
containing  the  simulated  common  memory  location.  It  follows  that  a  message  must  contain 
the  name  of  a  site  as  well  as  the  address  of  a  memory  location  at  that  site.  If  the  simulated 
access  is  a  memory  read,  a  return  message  is  generated  containing  the  value  of  the  memory 
location.  If  it  is  a  memory  write,  the  transmitted  message  must  also  contain  the  datum  to  write 
into  the  memory  location.  We  assume  that  the  sites  are  numbered  consecutively  from  1  to  p, 
the  number  of  processors. 

The  first  problem  to  be  solved  is  the  routing  of  messages  from  source  to  destination  pro¬ 
cessors.  This  routing  problem  was  partially  addressed  in  Section  7.8.  The  new  wrinkle  here  is 
that  the  mapping  from  source  to  destination  sites  defined  by  a  set  of  messages  is  not  necessarily 
a  permutation.  Not  all  sources  may  send  a  message  and  not  all  destinations  are  guaranteed  to 
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receive  only  one  message.  In  fact,  some  destination  may  be  sent  many  messages,  which  can 
result  in  their  waiting  a  long  time  for  receipt. 

To  develop  an  appreciation  for  the  various  approaches  to  this  problem,  we  describe  an 
algorithm  that  distributes  messages  from  sources  to  destinations,  though  not  as  efficiently  as 
possible.  Each  processor  prepares  a  message  to  be  sent  to  other  processors.  Processors  not 
accessing  the  common  memory  send  messages  containing  dummy  site  addresses  larger  than  any 
other  address.  All  messages  are  sorted  by  destination  address  cooperatively  by  the  processors. 
As  seen  in  Theorem  7.7.1,  they  can  be  sorted  by  a  normal  algorithm  on  an  p-vertex  hypercube, 
p  =  2d,  in  0(log2  p )  steps  using  Batcher’s  bitonic  sorting  network  described  in  Section  6.8.1. 
The  k  <  p  non-dummy  messages  are  the  first  k  messages  in  this  sorted  list.  If  the  sites  at 
which  these  messages  reside  after  sorting  are  the  sites  for  which  they  were  destined,  the  message 
routing  problem  is  solved.  Unfortunately,  this  is  generally  not  the  case. 

To  route  the  messages  from  their  positions  in  the  sorted  list  to  their  destinations,  we  first 
identify  duplicates  of  destination  addresses  and  compute  D,  the  maximum  number  of  dupli¬ 
cates.  We  then  route  messages  in  D  stages.  In  each  stage  at  most  one  of  the  D  duplicates 
of  each  message  is  routed  to  its  destination.  To  identify  duplicates,  we  assign  a  processor  to 
each  message  in  the  sorted  list  that  compares  its  destination  site  with  that  of  its  predecessor, 
setting  a  flag  bit  to  0  if  equal  and  to  1  otherwise.  To  compare  destinations,  move  messages 
to  adjacent  vertices  on  the  hypercube,  compare,  and  then  reverse  the  process.  (Move  them  by 
sorting  by  appropriate  addresses.)  The  first  processor  also  sets  its  flag  bit  to  1.  A  segmented 
integer  addition  prefix  operation  that  segments  its  messages  with  these  flag  bits  assigns  to  each 
message  an  integer  (a  priority)  between  1  and  D  that  is  q  if  the  site  address  of  this  message  is 
the  gth  such  address.  (Prefix  computations  can  be  done  on  a  p-vertex  hypercube  in  O(logp) 
steps.  See  Problem  7.23.)  A  message  with  priority  q  is  routed  to  its  destination  in  the  qth  stage. 
An  unsegmented  prefix  operation  with  max  as  the  operator  is  then  used  to  determine  D. 

In  the  gth  stage,  i  <q<  D,  all  non-dummy  messages  with  priority  q  are  routed  to  their 
destination  site  on  the  hypercube  as  follows: 

a)  one  processor  is  assigned  to  each  message; 

b)  each  such  processor  computes  the  gap,  the  difference  between  the  destination  and  current 
site  of  its  message; 

c)  each  gap  g  is  represented  as  a  binary  d-tuple  g  =  (gd- 1>  •  ■  ■ ,  go); 

d)  For  t  =  d  —  \,d  —  2, .  . . ,  0,  those  messages  whose  gap  contains  2l  are  sent  to  the  site 
reached  by  crossing  the  fth  dimension  of  the  hypercube. 

We  show  that  in  at  most  0(D  log  p)  steps  all  messages  are  routed  to  their  destinations. 
Let  the  sorted  message  sites  form  an  ascending  sequence.  If  there  are  k  non-dummy  messages, 
let  gap i,  0  <  i  <  k  —  1,  be  the  gap  of  the  ith  message.  Observe  that  these  gaps  must  also 
form  a  nondecreasing  sequence.  For  example,  shown  below  is  a  sorted  set  of  destinations  and 


corresponding  sequence  of  gaps: 

gaP  * 

1 

1  2 

2  3 

6 

7 

8 

desti 

1 

2  4 

5  7 

11 

13 

15 

i 

0 

1  2 

3  4 

5 

6 

7  8  9  10  11  12  13  15 

All  the  messages  whose  gaps  contain 

2d~ 

1  must  be  the  last  messages  in  the  sequence  be 

cause  the  gaps  would  otherwise  be  out  of  order.  Thus,  advancing  messages  with  these  gaps  by 
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2d~l  positions,  which  is  done  by  moving  them  across  the  largest  dimension  of  the  hypercube, 
advances  them  to  positions  in  the  sequence  that  cannot  be  occupied  by  any  other  messages, 
even  after  these  messages  have  been  advanced  by  their  full  gaps.  For  example,  shown  below  are 
the  positions  of  the  messages  given  above  after  those  whose  gaps  contain  8  and  4  have  been 
moved  by  this  many  positions: 


desti 

i 


1  2  4  5  7  11  13  15 

0  1  2  3  4  5  6  7  8  9  10  11  12  13  15 


Repeating  this  argument  on  subsequent  smaller  powers  of  2,  we  find  that  no  two  messages 
that  are  routed  in  a  given  stage  occupy  the  same  site.  As  a  consequence,  after  D  stages,  each 
taking  d  steps,  all  messages  are  routed.  We  summarize  this  result  below. 


THEOREM  7.9.4  Each  computation  cycle  of  a  p-processor  EREW  PRAM  can  be  simulated  by  a 
normal  algorithm  on  a  p-vertex  hypercube  in  0(D  log  p  +  log2  p)  steps,  where  D  is  the  maximum 
number  of  processors  accessing  memory  locations  stored  at  a  given  vertex  of  the  hypercube. 


This  result  can  be  improved  to  O(logp)  [158]  with  a  probabilistic  algorithm  that  replicates 
each  datum  at  each  hypercube  processor  a  fixed  number  of  times. 

Because  the  simulation  described  above  of  a  EREW  PRAM  on  a  hypercube  consists  of  a 
fixed  number  of  normal  steps  and  fully  normal  sequences  of  steps,  O(Dfp)-  and  O(Dp)- 
time  simulations  of  a  PRAM  on  two-dimensional  meshes  and  linear  arrays  follow.  (See  Prob¬ 
lems  7.32  and  7.33.) 


7.9.4  Circuits  and  the  CREW  PRAM 

Algebraic  and  logic  circuits  can  also  be  simulated  on  PRAMs,  in  particular  the  CREW  PRAM. 
For  simplicity  we  assign  one  processor  to  each  vertex  of  a  circuit  (a  gate).  We  also  assume  that 
each  vertex  has  bounded  fan-in,  which  for  concreteness  is  assumed  to  be  2.  We  also  reserve  one 
memory  location  for  each  gate  and  one  for  each  input  variable.  Each  processor  now  alternates 
between  reading  values  from  its  two  inputs  (concurrently  with  other  processors,  if  necessary) 
and  exclusively  writing  values  to  the  location  reserved  for  its  value.  Two  steps  are  devoted  to 
reading  the  values  of  gate  inputs.  Let  Dci(f)  be  the  depth  of  the  circuit  for  a  function  /.  After 
2Dci(f)  steps  the  input  values  have  propagated  to  the  output  gates,  the  values  computed  by 
them  are  correct  and  the  computation  is  complete. 

In  Section  8.14  we  show  a  stronger  result,  that  CREW  PRAMs  and  circuits  are  equivalent 
as  language  recognizers.  We  also  explore  the  parallel  computation  thesis,  which  states  that 
sequential  space  and  parallel  time  are  polynomially  related.  It  follows  that  the  PRAM  and  the 
logic  circuit  are  both  excellent  models  in  terms  of  which  to  measure  the  minimal  computation 
time  required  for  a  problem  on  a  parallel  machine.  In  Section  8.15  we  exhibit  complexity 
classes,  that  is,  classes  of  languages  defined  in  terms  of  the  depth  of  circuits  recognizing  them. 


7J0  The  BSP  and  LogP  Models 

Bulk  synchronous  parallelism  (BSP)  extends  the  MIMD  model  to  potentially  different  asyn¬ 
chronous  programs  running  on  the  physical  processors  of  a  parallel  computer.  Its  developers 
believe  that  the  BSP  model  is  both  built  on  realistic  assumptions  and  sufficiently  simple  to 
provide  an  attractive  model  for  programming  parallel  computers.  They  expect  it  will  play  a 
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role  similar  to  that  of  the  RAM  for  serial  computation,  that  is,  that  programs  written  for  the 
BSP  model  can  be  translated  into  efficient  code  for  a  variety  of  parallel  machines. 

The  BSP  model  explicitly  assumes  that  a)  computations  are  divided  into  supersteps,  b)  all 
processors  are  synchronized  after  each  superstep,  c)  processors  can  send  and  receive  messages 
to  and  from  all  other  processors,  d)  message  transmission  is  non-blocking  (computation  can 
resume  after  sending  a  message),  and  e)  all  messages  are  delivered  by  the  end  of  a  superstep. 
The  important  parameters  of  this  model  are  p,  the  number  of  processors,  s,  the  speed  of  each 
processor,  l,  the  latency  of  the  system,  which  is  the  number  of  processor  steps  to  synchronize 
processors,  and  g,  the  additional  number  of  processor  steps  per  word  to  deliver  a  message. 
Here  g  measures  the  time  per  word  to  transmit  a  message  between  processors  after  the  path 
between  them  has  been  set  up;  l  measures  the  time  to  set  up  paths  between  processors  and/or 
to  synchronize  all  p  processors.  Each  of  these  parameters  must  be  appraised  under  “normal” 
computational  and  communication  loads  if  the  model  is  to  provide  useful  estimates  of  the  time 
to  complete  a  task. 

For  the  BSP  model  to  be  effective,  it  must  be  possible  to  keep  the  processors  busy  while 
waiting  for  communications  to  be  completed.  If  the  latency  of  the  network  is  too  high,  this 
will  not  be  possible.  It  will  also  not  be  possible  if  algorithms  are  not  designed  properly.  For 
example,  if  all  processors  attempt  to  send  messages  to  a  single  processor,  network  congestion 
will  prevent  the  messages  from  being  answered  quickly.  It  has  been  shown  that  for  many 
important  problems  data  can  be  distributed  and  algorithms  designed  to  make  good  use  of  the 
BSP  model  [348] .  It  should  also  be  noted  that  the  BSP  model  is  not  effective  on  problems  that 
are  not  parallelizable,  such  as  may  be  the  case  for  P-complete  problems  (see  Section  8.9). 

Although  for  many  problems  and  machines  the  BSP  model  is  a  good  one,  it  does  not 
take  into  account  network  congestion  due  to  the  number  of  messages  in  transit.  The  LogP 
model  extends  the  BSP  model  by  explicitly  accounting  for  the  overhead  time  (the  o  in  LogP) 
to  prepare  a  message  for  transmission.  The  model  is  also  characterized  by  the  parameters  L,  g, 
and  P  that  have  the  same  meaning  as  the  parameters  /,  g,  and  p  in  the  BSP  model.  The  LogP 
and  BSP  models  are  about  equally  good  at  predicting  algorithm  performance. 

Many  other  models  have  been  proposed  to  capture  one  aspect  or  another  of  practical  par¬ 
allel  computation.  Chapter  1 1  discusses  some  of  the  parallel  I/O  issues. 


Problems 

PARALLEL  COMPUTERS  WITH  MEMORY 

7. 1  Consider  the  design  of  a  bus  arbitration  sequential  circuit  for  a  computer  containing 
four  CPUs.  This  circuit  has  four  Boolean  inputs  and  outputs,  one  per  CPU.  A  CPU 
requesting  bus  access  sets  its  input  to  1  and  waits  until  its  output  is  set  to  1 ,  after  which 
it  puts  its  word  and  destination  address  on  the  bus.  CPUs  not  requesting  bus  access  set 
their  bus  arbitration  input  variable  to  0. 

At  the  beginning  of  each  cycle  the  bus  arbitration  circuit  reads  the  input  variables  and, 
if  at  least  one  of  them  has  value  1 ,  sets  one  output  variable  to  1 .  If  all  input  variables 
are  0,  it  sets  all  output  variables  to  0. 

Design  two  such  arbitration  circuits,  one  that  grants  priority  to  the  lowest  indexed 
input  that  is  1  and  a  second  that  grants  priority  alternately  to  the  lowest  and  highest 
indexed  input  if  more  than  one  input  variable  is  1 . 
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Figure  7.22 


A  four-by-four  mesh-of-trees  network. 


7.2  Sketch  a  data-parallel  program  that  operates  on  a  sorted  list  of  keys  and  finds  the  largest 
number  of  times  that  a  key  is  repeated. 

7.3  Sketch  a  data-parallel  program  to  find  the  last  record  in  a  linked  list  where  initially  each 
record  contains  the  address  of  the  next  item  in  the  list  (except  for  the  last  item,  whose 
next  address  is  null). 

Hint:  Assign  one  processor  to  each  list  item  and  assume  that  accesses  to  two  or  more 
distinct  addresses  can  be  done  simultaneously. 

7.4  The  n  x  n  mesh-of-trees  network,  n  =  2r ,  is  formed  from  a  n  X  n  mesh  by  replac¬ 
ing  each  linear  connection  forming  a  row  or  column  by  a  balanced  binary  tree.  (See 
Fig.  7.22.)  Let  the  entries  of  two  nxn  matrices  be  uniformly  distributed  on  the  vertices 
of  original  mesh.  Give  an  efficient  matrix  multiplication  algorithm  on  this  network  and 
determine  its  running  time. 

7.5  Identify  problems  that  arise  in  a  crossbar  network  when  more  than  one  source  wishes 
to  connect  to  the  same  destination.  Describe  how  to  insure  that  only  one  source  is 
connected  to  one  destination  at  the  same  time. 

THE  PERFORMANCE  OF  PARALLEL  ALGORITHMS 

7.6  Describe  how  you  might  apply  Amdahl’s  Law  to  a  data-parallel  program  to  estimate  its 
running  time. 

7.7  Consider  the  evaluation  of  the  polynomial  p(x)  =  anxn +  xn-\Xn~l  +  ■  ■  -  +  a\X  +  aQ 
on  a  p-processor  shared-memory  machine.  Sketch  an  algorithm  whose  running  time  is 
0(^  +  log  n)  for  this  problem. 
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LINEAR  ARRAYS 

7.8  Generalize  the  example  of  Section  7.5.1  to  show  that  the  product  of  an  n  x  n  matrix 
and  an  n-vector  can  be  realized  in  3n  —  1  steps  on  a  linear  systolic  array. 

7.9  Show  that  every  algorithm  on  a  linear  array  to  compute  the  product  of  an  n  X  n  matrix 
and  an  n-vector  requires  at  least  n  steps.  Assume  that  components  of  the  matrix  and 
vector  enter  cells  individually. 

7.10  Design  an  algorithm  for  a  linear  array  of  length  O(n)  that  convolves  two  sequences 
each  of  length  n  in  0{n)  steps.  Show  that  no  substantially  faster  algorithm  for  such  a 
linear  array  exists. 

MULTIDIMENSIONAL  ARRAYS 

7.1 1  Show  that  at  most  er(d)  =  2d2  +  2d  +  1  cells  are  at  most  d  edges  away  from  any  cell 
in  a  two-dimensional  systolic  array. 

7.12  Derive  an  expression  for  the  distance  between  vertices  (ni,  n 2, . . . ,  rid )  and  (mi,  m2, 

. . . ,  md)  in  a  d-dimensional  toroidal  mesh  and  determine  the  maximum  distance  be¬ 
tween  two  such  vertices. 

7.13  Design  efficient  algorithms  to  multiply  two  n  X  n  matrices  on  a  k  X  k  mesh,  k  <  n. 

HYPERCUBE-BASED  MACHINES 

7.14  Show  that  the  vertices  of  the  2d-input  FFT  graph  can  be  numbered  so  that  edges  be¬ 
tween  levels  correspond  to  swaps  across  the  dimensions  of  a  d-dimensional  hypercube. 

7.15  Show  that  the  convolution  function  /ionv^  :  Rn+m  i— »  Rn+m -1  OVer  a  commutative 
ring  1Z  can  be  implemented  by  a  fully  normal  algorithm  in  time  0(log  n). 

7.16  Prove  that  the  unshuffle  operation  on  a  linear  array  of  n  =  2d  cells  can  be  done  with 
2d  —  1  comparison/exchange  steps. 

7.17  Prove  that  the  algorithm  described  in  Section  7.7.4  to  simulate  a  normal  hypercube 
algorithm  on  a  linear  array  of  n  =  2d  elements  correctly  places  into  exchange  locations 
elements  whose  indices  differ  by  successive  powers  of  2. 

7.18  Describe  an  efficient  algorithm  for  a  linear  array  that  merges  two  sorted  sequences  of 
the  same  length. 

7.19  Show  that  Batcher’s  sorting  algorithm  based  on  bitonic  merging  can  be  realized  on  an 
p-vertex  hypercube  by  a  normal  algorithm  in  0(\og2  p)  steps. 

7.20  Show  that  Batcher’s  sorting  algorithm  based  on  bitonic  merging  can  be  realized  on  a 
linear  array  of  n  =  2d  cells  in  O(n)  steps. 

7.21  Show  that  Batcher’s  sorting  algorithm  based  on  bitonic  merging  can  be  realized  on  an 
y/n  x  ^ fn  array  in  0(y/n)  steps. 

7.22  Design  an  0(y/n)-step  algorithm  to  implement  an  arbitrary  permutation  of  n  items 
placed  one  per  cell  of  an  y/n  x  y/n  mesh. 

7.23  Describe  a  normal  algorithm  to  realize  a  prefix  computation  on  a  p-vertex  hypercube  in 
O(logp)  steps. 
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7.24  Design  an  algorithm  to  perform  a  prefix  computation  on  an  sfn  X  yfil  mesh  in  3 \Jn 
steps.  Show  that  no  other  algorithm  for  this  problem  on  this  mesh  has  substantially 
better  performance. 

ROUTING  IN  NETWORKS 

7.25  Give  a  complete  description  of  a  procedure  to  set  up  the  switches  in  a  Benes  network. 

7.26  Show  how  to  perform  an  arbitrary  permutation  on  a  linear  array. 

THE  PRAM  MODEL 

7.27  a)  Design  an  0(l)-step  CRCW  PRAM  algorithm  to  find  the  maximum  element  in  a 

list. 

b)  Design  an  0(log  log  n)-step  CRCW  PRAM  algorithm  to  find  the  maximum  ele¬ 
ment  in  a  list  that  uses  0(n)  processors. 

Hint:  Construct  a  tree  in  which  the  root  and  every  other  vertex  has  a  number  of 
immediate  descendants  that  is  about  equal  to  the  square  root  of  the  number  of  leaves 
that  are  its  descendants. 

7.28  The  goal  of  the  list-ranking  problem  is  to  assign  a  rank  to  each  record  in  a  linked 
list;  the  rank  of  a  record  is  its  position  relative  to  the  last  element  in  the  list  where  the 
last  element  has  rank  zero.  Each  record  has  two  fields,  one  for  its  rank  and  another  for 
the  address  of  its  successor  record.  The  address  field  of  the  last  record  contains  its  own 
address. 

Describe  an  efficient  p-processor  EREW  PRAM  algorithm  to  solve  the  list-ranking 
problem  for  a  list  of  p  items  stored  one  per  location  in  the  common  memory. 

Hint:  Use  pointer  doubling  in  which  each  address  is  replaced  by  the  address  of  its 
current  successor. 

7.29  Consider  an  n-vertex  directed  graph  in  which  each  vertex  knows  the  address  of  its 
parent  and  the  roots  have  themselves  as  parents.  Under  the  assumption  that  each  vertex 
is  placed  in  a  unique  cell  in  a  common  PRAM  memory,  show  that  the  roots  can  be 
found  in  O(logn)  steps. 

7.30  Design  an  efficient  PRAM  algorithm  to  find  the  item  in  a  list  that  occurs  most  often. 

7.31  Figure  7.23  shows  two  trees  containing  one  and  three  copies  of  a  computational  ele¬ 
ment,  respectively.  This  element  accepts  three  inputs  and  produces  three  outputs  using 
©,  an  associative  operator.  Tree  (a)  accepts  a,  b,  and  c  as  input  and  produces  a,  a  ©  b, 
and  b  ©  c  as  output.  Tree  (b)  accepts  a,  b,  c,  d,  and  e  as  input  and  produces  a,  a  Q  b, 
aQbQc,  aQbQcQd,  and  fr  0  c  ©  d  Q  e  as  output.  If  the  input  and  output  at  the  root 
of  the  trees  are  combined  with  Q,  the  output  of  each  tree  is  the  prefix  computation  on 
its  inputs. 

Generalize  the  constructions  of  Fig.  7.23  to  produce  a  circuit  for  the  prefix  function  on 
n  inputs,  n  arbitrary.  Give  a  convincing  argument  that  your  construction  is  correct  and 
derive  good  upper  bounds  on  the  size  and  depth  of  your  circuit.  Show  that  to  within 
multiplicative  factors  your  construction  has  minimal  size  and  depth. 
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Figure  7.23  Components  of  an  efficient  prefix  circuit. 


7.32  Show  that  each  computation  cycle  of  a  p-processor  EREW  PRAM  can  be  simulated  on 
a  \fP  x  \/V  mesh  in  0(D^/p)  steps,  where  D  is  the  maximum  number  of  processors 
accessing  memory  locations  stored  at  a  given  vertex  of  the  mesh. 

7.33  Show  that  each  computation  cycle  of  a  p-processor  EREW  PRAM  can  be  simulated 
on  a  p- processor  linear  array  in  O(Dp)  steps,  where  D  is  the  maximum  number  of 
processors  accessing  memory  locations  stored  at  a  given  vertex  of  the  array. 

THE  BSP  AND  LOGP  MODELS 

7.34  Design  an  algorithm  for  the p-processor  BSP  and/or  LogP  models  to  multiply  two  nxn 
matrices  when  each  matrix  entry  occurs  once  and  entries  are  uniformly  distributed  over 
the  p  processors.  Given  the  parameters  of  the  models,  determine  for  which  values  of  n 
your  algorithm  is  efficient. 

Hint:  The  performance  of  your  algorithm  will  be  dependent  on  the  initial  placement 
of  data. 

7.35  Design  an  algorithm  for  the  p-processor  BSP  and/or  LogP  models  for  the  segmented 
prefix  function.  Given  the  parameters  of  the  models,  determine  for  which  values  of  n 
your  algorithm  is  efficient. 

Chapter  Notes 

A  discussion  of  parallel  algorithms  and  architectures  up  to  about  1980  can  be  found  in  the  book 

by  Hockney  and  Jesshope  [135].  A  number  of  recent  textbooks  provide  extensive  coverage  of 
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parallel  algorithms  and  architectures.  They  include  the  books  by  Akl  [16],  Bertsekas  and 
Tsitsiklis  [38],  Gibbons  and  Spirakis  [113],  JaJa  [148],  Leighton  [192],  Quinn  [265],  and  Reif 
[277].  In  addition,  the  survey  article  by  Karp  and  Ramachandran  [161]  gives  an  overview 
of  parallel  algorithmic  methods.  References  to  results  on  circuit  complexity  can  be  found  in 
Chapters  2,  6,  and  9. 

Flynn  introduced  the  taxonomy  of  parallel  computers  that  carries  his  name  [102].  The 
data-parallel  style  of  computing  was  anticipated  in  the  APL  [146]  and  FP  programming  lan¬ 
guages  [26]  as  well  as  by  Preparata  and  Vuillemin  [262]  in  their  study  of  parallel  algorithms 
for  networked  machines.  It  was  developed  as  the  style  of  choice  for  programming  the  Connec¬ 
tion  Machine  [133].  (See  also  the  books  by  Hatcher  and  Quinn  [129]  and  Blelloch  [45]  on 
data-parallel  computing.)  The  simulation  of  the  MIMD  computer  by  a  SIMD  one  given  in 
Section  7.3.1  is  due  to  Wloka  [365]. 

Amdahl’s  Law  [21]  and  Brent’s  principle  [58]  are  widely  cited;  the  latter  is  used  extensively 
to  design  efficient  parallel  algorithms. 

Systolic  algorithms  for  convolution,  matrix  multiplication,  and  the  fast  Fourier  transform 
are  given  by  Kung  and  Leiserson  [180]  (see  also  [181]).  Odd-even  transposition  sort  is  de¬ 
scribed  by  Knuth  [170].  The  lower  bound  on  the  time  to  multiply  two  matrices  given  in 
Theorem  7.5.3  is  due  to  Gentleman  [112],  The  shuffle  network  was  introduced  by  Stone 
[318], 

Preparata  and  Vuillemin  [262]  give  normal  algorithms  for  a  variety  of  problems  (including 
that  for  shifting  in  Section  7.7.3)  and  introduce  the  cube-connected  cycles  machine.  They  also 
give  embeddings  of  fully  normal  algorithms  into  linear  arrays  and  meshes.  Dekel,  Nassimi,  and 
Sahni  [85]  developed  the  fast  algorithm  for  matrix  multiplication  on  the  hypercube  described 
in  Section  7.7.7. 

Batcher  [29]  introduced  odd-even  and  bitonic  sorting  methods  and  noted  that  they  could 
be  used  for  routing  messages  in  networks.  Benes  [36]  is  the  author  of  the  Benes  permutation 
network. 

Variants  of  the  PRAM  were  introduced  by  Fortune  and  Wyllie  [103],  Goldschlager  [118], 
Savitch  and  Stimson  [298]  as  generalizations  of  the  idealized  RAM  model  of  Cook  and  Reek- 
how  [77].  The  method  given  in  Theorem  7.9.3  to  simulate  a  CRCW  PRAM  on  an  EREW 
PRAM  is  due  to  Eckstein  [95]  and  Vishkin  [353].  Simulations  of  PRAMs  on  networked  com¬ 
puters  have  been  developed  by  Mehlhorn  and  Vishkin  [221],  Upfal  [340],  Upfal  and  Wigder- 
son  [341],  Karlin  and  Upfal  [158],  Alt,  Hagerup,  Mehlhorn,  and  Preparata  [19],  and  Ranade 
[267].  Cypher  and  Plaxton  [84]  have  developed  a  deterministic  O  (log  p  log  log  p) -step  sort¬ 
ing  algorithm  for  the  hypercube.  However,  it  is  superior  to  Batcher’s  algorithm  only  for  very 
large  and  impractical  values  of  p. 

The  bulk  synchronous  parallel  (BSP)  model  [348]  has  been  proposed  as  a  bridging  model 
between  the  needs  of  programmers  and  parallel  machines.  The  LogP  model  [83]  is  offered  as 
a  more  realistic  variant  of  the  BSP  model.  Juurlink  and  Wijshoff  [154]  and  Bilardi,  Herley, 
Pietracaprina,  Pucci,  and  Spirakis  [39]  report  empirical  evidence  that  the  BSP  and  LogP  models 
are  about  equally  good  as  predictors  of  performance  on  real  parallel  computers. 
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In  an  ideal  world,  each  computational  problem  would  be  classified  at  least  approximately  by  its 
use  of  computational  resources.  Unfortunately,  our  ability  to  so  classify  some  important  prob¬ 
lems  is  limited.  We  must  be  content  to  show  that  such  problems  fall  into  general  complexity 
classes,  such  as  the  polynomial-time  problems  P,  problems  whose  running  time  on  a  determin¬ 
istic  Turing  machine  is  a  polynomial  in  the  length  of  its  input,  or  NP,  the  polynomial-time 
problems  on  nondeterministic  Turing  machines. 

Many  complexity  classes  contain  “complete  problems,”  problems  that  are  hardest  in  the 
class.  If  the  complexity  of  one  complete  problem  is  known,  that  of  all  complete  problems  is 
known.  Thus,  it  is  very  useful  to  know  that  a  problem  is  complete  for  a  particular  complexity 
class.  For  example,  the  class  of  NP-complete  problems,  the  hardest  problems  in  NP,  contains 
many  hundreds  of  important  combinatorial  problems  such  as  the  Traveling  Salesperson  Prob¬ 
lem.  It  is  known  that  each  NP-complete  problem  can  be  solved  in  time  exponential  in  the  size 

of  the  problem,  but  it  is  not  known  whether  they  can  be  solved  in  polynomial  time.  Whether 

? 

P  and  NP  are  equal  or  not  is  known  as  the  P  =  NP  question.  Decades  of  research  have  been 
devoted  to  this  question  without  success.  As  a  consequence,  knowing  that  a  problem  is  NP- 
complete  is  good  evidence  that  it  is  an  exponential-time  problem.  On  the  other  hand,  if  one 
such  problem  were  shown  to  be  in  P,  all  such  problems  would  be  been  shown  to  be  in  P,  a 
result  that  would  be  most  important. 

In  this  chapter  we  classify  problems  by  the  resources  they  use  on  serial  and  parallel  ma¬ 
chines.  The  serial  models  are  the  Turing  and  random-access  machines.  The  parallel  models 
are  the  circuit  and  the  parallel  random-access  machine  (PRAM).  We  begin  with  a  discussion 
of  tasks,  machine  models,  and  resource  measures,  after  which  we  examine  serial  complexity 
classes  and  relationships  among  them.  Complete  problems  are  defined  and  the  P-complete, 
NP-complete,  and  PSPACE-complete  problems  are  examined.  We  then  turn  to  the  PRAM 
and  circuit  models  and  conclude  by  identifying  important  circuit  complexity  classes  such  as 
NC  and  P/poly. 
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8.1  Introduction 

The  classification  of  problems  requires  a  precise  definition  of  those  problems  and  the  com¬ 
putational  models  used.  Problems  are  accurately  classified  only  when  we  are  sure  that  they 
have  been  well  defined  and  that  the  computational  models  against  which  they  are  classified  are 
representative  of  the  computational  environment  in  which  these  problems  will  be  solved.  This 
requires  the  computational  models  to  be  general.  On  the  other  hand,  to  be  useful,  problem 
classifications  should  not  be  overly  dependent  on  the  characteristics  of  the  machine  model  used 
for  classification  purposes.  For  example,  because  of  the  obviously  inefficient  use  of  memory  on 
the  Turing  machine,  the  set  of  problems  that  runs  in  time  linear  in  the  length  of  their  input  on 
a  random-access  machine  is  likely  to  be  different  from  the  set  that  runs  in  linear  time  on  the 
Turing  machine.  On  the  other  hand,  the  set  of  problems  that  run  in  polynomial  time  on  both 
machines  is  the  same. 


8.2  Languages  and  Problems 

Before  formally  defining  decision  problems,  a  major  topic  of  this  chapter,  we  give  two  examples 
of  them,  SATISFIABILITY  and  UNSATISFIABILITY.  A  set  of  clauses  is  satisfiable  if  values  can 
be  assigned  to  Boolean  variables  in  these  clauses  such  that  each  clause  has  at  least  one  literal 
with  value  1 . 

SATISFIABILITY 

Instance:  A  set  of  literals  X  =  {xi,Xi,  X2,X2,  ■  ■  ■ ,  xn,xn},  and  a  sequence  of  clauses 
C  =  (ci,  C2,  ■  •  • ,  cm)  where  each  clause  c,  is  a  subset  of  X. 

Answer:  “Yes”  if  for  some  assignment  of  Boolean  values  to  variables  in  {xi,  X2,  ■  ■  ■ ,  xn},  at 
least  one  literal  in  each  clause  has  value  1 . 

The  complement  of  the  decision  problem  SATISFIABILITY,  UNSATISFIABILITY,  is  defined 
below. 

UNSATISFIABILITY 

Instance:  A  set  of  literals  X  =  {x\,x\,  X2,X2,  ■  ■  ■ ,  xn,xn},  and  a  sequence  of  clauses 
C  =  (ci,  C2, . . . ,  Cm)  where  each  clause  Ci  is  a  subset  of  X. 

Answer:  “Yes”  if  for  all  assignments  of  Boolean  values  to  variables  in  {xi,  X2,  ■  ■  ■ ,  xn},  all 
literals  in  at  least  one  clause  have  value  0. 

The  clauses  C\  =  ({xi,  X2,  X3},  {xi,  X2},  {X2,  X3})  are  satisfied  with  Xi  =  X2  =  X3  =  1, 
whereas  the  clauses  C2  =  {{x\,  X2,  X3},  {xi,  X2},  {X2,  X3},  {X3,  ah},  {xi,  X2,  X3})  are  not 
satisfiable.  SATISFIABILITY  consists  of  collections  of  satisfiable  clauses.  C\  is  in  SATISFIABIL¬ 
ITY.  The  complement  of  SATISFIABILITY,  UNSATISFIABILITY,  consists  of  instances  of  clauses 
not  all  of  which  can  be  satisfied.  C2  is  in  UNSATISFIABILITY. 

We  now  introduce  terminology  used  to  classify  problems.  This  terminology  and  the  asso¬ 
ciated  concepts  are  used  throughout  this  chapter. 

DEFINITION  8.2. 1  Let  £  be  an  arbitrary  finite  alphabet.  A  decision  problem  V  is  defined  by  a 
set  of  instances  I  C  £*  of  the  problem  and  a  condition  <f>p  :  I  1— >  B  that  has  value  1  on  “Yes” 
instances  and  0  on  “No”  instances.  Then  Iyes  =  {w  €  I  \  0V  (w)  =  1}  are  the  “Yes”  instances. 
The  “No”  instances  are  Ino  =  I  —  /yes. 
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The  complement  of  a  decision  problem  V,  denoted  coV,  is  the  decision  problem  in  which 
the  “Yes”  instances  of  coV  are  the  “No”  instances  of  V  and  vice  versa. 

The  “Yes”  instances  of  a  decision  problem  are  encoded  as  binary  strings  by  an  encoding  func¬ 
tion  a  :  £*  i— >  B*  that  assigns  to  each  w  £  I  a  string  a(w)  £  B* . 

With  respect  to  a,  the  language  L(V)  associated  with  a  decision  problem  V  is  the  set 
L(V)  =  {cr(w)  |  w  £  /yes}-  With  respect  to  a,  the  language  L(coV)  associated  with  coV  is  the 
setL(coV)  =  {ct(w)  |  w  £  Jno}. 

The  complement  of  a  language  L,  denoted  L,  is  B*  —  L;  that  is,  L  consists  of  the  strings 
that  are  not  in  L. 

A  decision  problem  can  be  generalized  to  a  problem  V  characterized  by  a  function  f  :  B*  t— > 
B*  described  by  a  set  of  ordered  pairs  ( x ,  f(x)),  where  each  string  x  £  B*  appears  once  as  the 
left-hand  side  of  a  pair.  Thus,  a  language  is  defined  by  problems  /  :  B*  >  B  and  consists  of  the 
strings  on  which  f  has  value  1 . 

SATISFIABILITY  and  all  other  decision  problems  in  NP  have  succinct  “certificates”  for 
“Yes”  instances,  that  is,  choices  on  a  nondeterministic  Turing  machine  that  lead  to  acceptance 
of  a  “Yes”  instance  in  a  number  of  steps  that  is  a  polynomial  in  the  length  of  the  instance.  A 
certificate  for  an  instance  of  SATISFIABILITY  consists  of  values  for  the  variables  of  the  instance 
on  which  each  clause  has  at  least  one  literal  with  value  1 .  The  verification  of  such  a  certificate 
can  be  done  on  a  Turing  machine  in  a  number  of  steps  that  is  quadratic  in  the  length  of  the 
input.  (See  Problem  8.3.) 

Similarly,  UNSATISFIABILITY  and  all  other  decision  problems  in  coNP  can  be  disqualified 
quickly;  that  is,  their  “No”  instances  can  be  “disqualified”  quickly  by  exhibiting  certificates  for 
them  (which  are  certificates  for  the  “Yes”  instance  of  the  complementary  decision  problem). 
For  example,  a  disqualification  for  UNSATISFIABILITY  is  a  satisfiable  assignment  for  a  “No” 
instance,  that  is,  a  satisfiable  set  of  clauses. 

It  is  not  known  how  to  identify  a  certificate  for  a  “Yes”  instance  of  SATISFIABILITY  or  any 
other  NP-complete  problem  in  time  polynomial  in  length  of  the  instance.  If  a  “Yes”  instance 
has  n  variables,  an  exhaustive  search  of  the  2n  values  for  the  n  variables  is  about  the  best  general 
method  known  to  find  an  answer. 

8.2.1  Complements  of  Languages  and  Decision  Problems 

There  are  many  ways  to  encode  problem  instances.  For  example,  for  SATISFIABILITY  we 
might  represent  Xj_  as  i  and  xt  as  and  then  use  the  standard  seven-bit  ASCII  encodings  for 
characters.  Then  we  would  translate  the  clause  { x 4,  x-y}  into  {4,  ~7}  and  then  represent  it  as 
123  052  044  126  055  125,  where  each  number  is  a  decimal  representing  a  binary  7-tuple  and 
4,  comma,  and  ~  are  represented  by  052,  044,  and  126,  respectively,  for  example. 

All  the  instances  /  of  decision  problems  V  considered  in  this  chapter  are  characterized 
by  regular  expressions.  In  addition,  the  encoding  function  of  Definition  8.2.1  can  be  chosen 
to  map  strings  in  /  to  binary  strings  cr(I)  describable  by  regular  expressions.  Thus,  a  finite- 
state  machine  can  be  used  to  determine  if  a  binary  string  is  in  er(/)  or  not.  We  assume  that 
membership  of  a  string  in  er(/)  can  be  determined  efficiently. 

As  suggested  by  Fig.  8.1,  the  strings  in  L(V),  the  complement  of  L{V),  are  either  strings 
in  L(coP)  or  strings  in  —  I).  Since  testing  of  membership  in  cr(S*  —  I)  is  easy,  testing 
for  membership  in  L(V)  and  L(coP)  requires  about  the  same  space  and  time.  For  this  reason, 
we  often  equate  the  two  when  discussing  the  complements  of  languages. 
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Figure  8. 1  The  language  L(f)  of  a  decision  problem  P  and  the  language  of  its  complement 
L(  coP).  The  languages  L(V)  and  L( coV)  encode  all  instances  of  I.  The  complement  of  L(V), 
Z/(P),  is  the  union  of  L(coP)  with  <r(E*  —  I),  strings  that  are  in  neither  HV)  nor  L(co  V). 


8.3  Resource  Bounds 


One  of  the  most  important  problems  in  computer  science  is  the  identification  of  the  computa¬ 
tionally  feasible  problems.  Currently  a  problem  is  considered  feasible  if  its  running  time  on  a 
DTM  (deterministic  Turing  machine)  is  polynomial.  (Stated  by  Edmonds  [96] ,  this  is  known 
as  the  serial  computation  thesis.)  Note,  however,  that  some  polynomial  running  times,  such 
as  n1000  ,  where  n  is  the  length  of  a  problem  instance,  can  be  enormous.  In  this  case  doubling 
n  increases  the  time  bound  by  a  factor  of  21000,  which  is  approximately  10301! 

Since  problems  are  classified  by  their  use  of  resources,  we  need  to  be  precise  about  resource 
bounds.  These  are  functions  r  :  IN  ]N  from  the  natural  numbers  IN  =  {0, 1,  2,  3, . . .}  to 
the  natural  numbers.  The  resource  functions  used  in  this  chapter  are: 


Logarithmic  function 
Poly-logarithmic  function 
Linear  function 
Polynomial  function 
Exponential  function 


r(n)  =  O(logn) 
r(n)  =  log°<'1^  n 
r(n)  =  0(n) 
r(n)  =  n° <-1-) 
r(n)  =  2n°(1> 


A  resource  function  that  grows  faster  than  any  polynomial  is  called  a  superpolynomial  func¬ 
tion.  For  example,  the  function  / (n)  =  2log  ™  grows  faster  than  any  polynomial  (the  ratio 
log  /  (n)  /  log  n  is  unbounded)  but  more  slowly  than  any  exponential  (for  any  k  >  0  the  ratio 
(log2  n)/nk  becomes  vanishingly  small  with  increasing  n). 

Another  note  of  caution  is  appropriate  here  when  comparing  resource  functions.  Even 
though  one  function,  r(n),  may  grow  more  slowly  asymptotically  than  another,  s(n),  it  may 
still  be  true  that  r(n)  >  s(n)  for  very  large  values  of  n.  For  example,  r(n)  =  10  log4  n  > 
s(n)  =  n  for  n  <  1,889,750  despite  the  fact  that  r(n)  is  much  smaller  than  s(n)  for  large  n. 

Some  resource  functions  are  so  complex  that  they  cannot  be  computed  in  the  time  or  space 
that  they  define.  For  this  reason  we  assume  throughout  this  chapter  that  all  resource  functions 
are  proper.  (Definitions  of  time  and  space  on  Turing  machines  are  given  in  Section  8.4.2.) 


DEFINITION  8.3. 1  A  function  r  :  IN  i— >  IN  is  proper  if  it  is  nondecreasing  ( r(n  +  1)  >  r(n)) 
and  for  some  tape  symbol  a  there  is  a  deterministic  multi-tape  Turing  machine  M  that,  on  all 


©John  E  Savage 


8.4  Serial  Computational  Models 


331 


inputs  of  length  n  in  time  0(n  +  r(n))  and  temporary  space  r(n),  writes  the  string  ar ^  (unary- 
notation  for  r(n ) )  on  one  of  its  tapes  and  halts. 

Thus,  if  a  resource  function  r(n)  is  proper,  there  is  a  DTM,  Mr,  that  given  an  input  of  length 
n  can  write  r(n)  markers  on  one  of  its  tapes  within  time  0(n+r(n))  and  space  r(n).  Another 
DTM,  M,  can  use  a  copy  of  Mr  to  mark  r(n)  squares  on  a  tape  that  can  be  used  to  stop  M 
after  exactly  Kr(n)  steps  for  some  constant  K.  The  resource  function  can  also  be  used  to 
insure  that  M  uses  no  more  than  Kr(n )  cells  on  its  work  tapes. 

8.4  Serial  Computational  Models 

We  consider  two  serial  computational  models  in  this  chapter,  the  random-access  machine 
(RAM)  introduced  in  Section  3.4  and  the  Turing  machine  defined  in  Chapter  5. 

In  this  section  we  show  that,  up  to  polynomial  differences  in  running  time,  the  random- 
access  and  Turing  machines  are  equivalent.  As  a  consequence,  if  the  running  time  of  a  problem 
on  one  machine  grows  at  least  as  fast  as  a  polynomial  in  the  length  of  a  problem  instance,  then 
it  grows  equally  fast  on  the  other  machine.  This  justifies  using  the  Turing  machine  as  basis  for 
classifying  problems  by  their  serial  complexity. 

In  Sections  8.13and8.l4we  examine  two  parallel  models  of  computation,  the  logic  circuit 
and  the  parallel  random-access  machine  (PRAM). 

Before  beginning  our  discussion  of  models,  we  note  that  any  model  can  be  considered 
either  serial  or  parallel.  For  example,  a  finite-state  machine  operating  on  inputs  and  states 
represented  by  many  bits  is  a  parallel  machine.  On  the  other  hand,  a  PRAM  that  uses  one 
simple  RAM  processor  is  serial. 

8.4.1  The  Random-Access  Machine 

The  random-access  machine  (RAM)  is  introduced  in  Section  3.4.  (See  Fig.  8.2.)  In  this  section 
we  generalize  the  simulation  results  developed  in  Section  3.7  by  considering  a  RAM  in  which 
words  are  of  potentially  unbounded  length.  This  RAM  is  assumed  to  have  instructions  for 


Random-Access  Memory 


Figure  8.2  A  RAM  in  which  the  number  and  length  of  words  are  potentially  unbounded. 
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addition,  subtraction,  shifting  left  and  right  by  one  place,  comparison  of  words,  and  Boolean 
operations  of  AND,  OR,  and  NOT  (the  operations  are  performed  on  corresponding  components 
of  the  source  vectors),  as  well  as  conditional  and  unconditional  jump  instructions.  The  RAM 
also  has  load  (and  store)  instructions  that  move  words  to  (from)  registers  from  (to)  the  random- 
access  memory.  Immediate  and  direct  addressing  are  allowed.  An  immediate  address  contains 
a  value,  a  direct  address  is  the  address  of  a  value,  and  an  indirect  address  is  the  address  of 
the  address  of  a  value.  (As  explained  in  Section  3.10  and  stated  in  Problem  3.10,  indirect 
addressing  does  not  add  to  the  computing  power  of  the  RAM  and  is  considered  only  in  the 
problems.) 

The  time  on  a  RAM  is  the  number  of  steps  it  executes.  The  space  is  the  maximum  number 
of  bits  of  storage  used  either  in  the  CPU  or  the  random-access  memory  during  a  computation. 

We  simplify  the  RAM  without  changing  its  nature  by  eliminating  its  registers,  treating 
location  0  of  the  random-access  memory  as  the  accumulator,  and  using  memory  locations  as 
registers.  The  RAM  retains  its  program  counter,  which  is  incremented  on  each  instruction 
execution  (except  for  a  jump  instruction,  when  its  value  is  set  to  the  address  supplied  by  the 
jump  instruction).  The  word  length  of  the  RAM  model  is  typically  allowed  to  be  unlimited, 
although  in  Section  3.4  we  limited  it  to  6  bits.  A  RAM  program  is  a  finite  sequence  of  RAM 
instructions  that  is  stored  in  the  random-access  memory.  The  RAM  implements  the  stored- 
program  concept  described  in  Section  3.4. 

In  Theorem  3.8.1  we  showed  that  a  6-bit  standard  Turing  machine  (its  tape  alphabet  con¬ 
tains  2b  characters)  executing  T  steps  and  using  S  bits  of  storage  (5/6  words)  can  be  simulated 
by  the  RAM  described  above  in  0(T)  steps  with  0(5)  bits  of  storage.  Similarly,  we  showed 
that  a  6-bit  RAM  executing  T  steps  and  using  5  bits  of  memory  can  be  simulated  by  an  0(6)- 
bit  standard  Turing  machine  in  0(5T  log2  5)  steps  and  0(5  log  5)  bits  of  storage.  As  seen 
in  Section  5.2,  T-step  computations  on  a  multi-tape  TM  can  be  simulated  in  0(T 2)  steps  on 
a  standard  Turing  machine. 

If  we  could  insure  that  a  RAM  that  executes  T  steps  uses  a  highest  address  that  is  0(T )  and 
generates  words  of  fixed  length,  then  we  could  use  the  above-mentioned  simulation  to  establish 
that  a  standard  Turing  machine  can  simulate  an  arbitrary  T-step  RAM  computation  in  time 
0(T 2  log2  T )  and  space  0(5  log  5)  measured  in  bits.  Unfortunately,  words  can  have  length 
proportional  to  O(T)  (see  Problem  8.4)  and  the  highest  address  can  be  much  larger  than  T  due 
to  the  use  of  jumps.  Nonetheless,  a  reasonably  efficient  polynomial-time  simulation  of  a  RAM 
computation  by  a  DTM  can  be  produced.  Such  a  DTM  places  one  (address ,  contents) 
pair  on  its  tape  for  each  RAM  memory  location  visited  by  the  RAM.  (See  Problem  8.5.) 

We  leave  the  proof  of  the  following  result  to  the  reader.  (See  Problem  8.6.) 

THEOREM  8.4.1  Every  computation  on  the  RAM  using  time  T  can  be  simulated  by  a  deterministic 
Turing  machine  in  0(T 3)  steps. 

In  light  of  the  above  results  and  since  we  are  generally  interested  in  problems  whose  time 
is  polynomial  in  the  length  of  the  input,  we  use  the  DTM  as  our  model  of  serial  computation. 

8.4.2  Turing  Machine  Models 

The  deterministic  and  nondeterministic  Turing  machines  (DTM  and  NDTM)  are  discussed 
in  Sections  3.7,  5.1,  and  5.2.  (See  Fig.  8.3.)  In  this  chapter  we  use  multi-tape  Turing  machines 
to  define  classes  of  problems  characterized  by  their  use  of  time  and  space.  As  shown  in  The- 
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Figure  8.3  A  one-tape  nondeterministic  Turing  machine  whose  control  unit  has  an  external 
choice  input  that  disambiguates  the  value  of  its  next  state. 


orem  5.2.2,  the  general  language-recognition  capability  of  DTMs  and  NDTMs  is  the  same, 
although,  as  we  shall  see,  their  ability  to  recognize  languages  within  the  same  resource  bounds 
is  very  different. 

We  recognize  two  types  of  Turing  machine,  the  standard  one-tape  DTM  and  NDTM  and 
the  multi-tape  DTM  and  NDTM.  The  multi-tape  versions  are  defined  here  to  have  one  read¬ 
only  input  tape,  one  write-only  output  tape,  and  one  or  more  work  tapes.  The  space  on  these 
machines  is  defined  to  be  the  number  of  work  tape  cells  used  during  a  computation.  This 
measure  allows  us  to  classify  problems  by  a  storage  that  may  be  less  than  linear  in  the  size  of 
the  input.  Time  is  the  number  of  steps  they  execute.  It  is  interesting  to  compare  these  measures 
with  those  for  the  RAM.  (See  Problem  8.7.)  As  shown  on  Section  5.2,  we  can  assume  without 
loss  of  generality  that  each  NDTM  has  either  one  or  two  choices  for  next  state  for  any  given 
input  letters  and  state. 

As  stated  in  Definitions  3.7. 1  and  5.1.1,  a  DTM  M  accepts  the  language  L  if  and  only  if 
for  each  string  in  L  placed  left-adjusted  on  the  otherwise  blank  input  tape  it  eventually  enters 
the  accepting  halt  state.  A  language  accepted  by  a  DTM  M  is  recursive  if  M  halts  on  all 
inputs.  Otherwise  it  is  recursively  enumerable.  A  DTM  M  computes  a  partial  function  / 
if  for  each  input  string  w  for  which  /  is  defined,  it  prints  f(w)  left-adjusted  on  its  otherwise 
blank  output  tape.  A  complete  function  is  one  that  is  defined  on  all  points  of  its  domain. 

As  stated  in  Definition  5.2.1,  an  NDTM  accepts  the  language  L  if  for  each  string  w  in 
L  placed  left-adjusted  on  the  otherwise  blank  input  tape  there  is  a  choice  input  c  for  M  that 
leads  to  an  accepting  halt  state.  A  NDTM  M  computes  a  partial  function  /  :  B*  i— >  B*  if 
for  each  input  string  w  for  which  /  is  defined,  there  is  a  sequence  of  moves  by  M  that  causes 
it  to  print  /(u>)  on  its  output  tape  and  enter  a  halt  state  and  there  is  no  choice  input  for  which 
M  prints  an  incorrect  result. 

The  oracle  Turing  machine  (OTM),  the  multi-tape  DTM  or  NDTM  with  a  special  oracle 
tape,  defined  in  Section  5.2.3,  is  used  to  classify  problems.  (See  Problem  8.15.)  Time  on  an 
OTM  is  the  number  of  steps  it  takes,  where  one  consultation  of  the  oracle  is  one  step,  whereas 
space  is  the  number  of  cells  used  on  its  work  tapes  not  including  the  oracle  tape. 
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A  precise  Turing  machine  M  is  a  multi-tape  DTM  or  NDTM  for  which  there  is  a  func¬ 
tion  r(n)  such  that  for  every  n  >  1,  every  input  w  of  length  n,  and  every  (possibly  nondeter- 
ministic)  computation  by  M,  M  halts  after  precisely  r(n)  steps. 

We  now  show  that  if  a  total  function  can  be  computed  by  a  DTM,  NDTM,  or  OTM 
within  a  proper  time  or  space  bound,  it  can  be  computed  within  approximately  the  same 
resource  bound  by  a  precise  TM  of  the  same  type.  The  following  theorem  justifies  the  use  of 
proper  resource  functions. 

THEOREM  8.4.2  Letr(n)  be  a  proper  function  with  r(n)  >  n.  Let  Ad  be  a  multi-tape  DTM, 
NDTM,  or  OTM  with  k  work  tapes  that  computes  a  total  function  f  in  time  or  space  r(n).  Then 
there  is  a  constant  K  >  0  and  a  precise  Turing  machine  of  the  same  type  that  computes  f  in  time 
and  space  Kr(n). 

Proof  Since  r(n)  is  a  proper  function,  there  is  a  DTM  Mr  that  computes  its  value  from  an 
input  of  length  n  in  time  K\r(n)  for  some  constant  K i  >  0  and  in  space  r(n).  We  design 
a  precise  TM  Mp  computing  the  same  function. 

The  TM  Mp  has  an  “enumeration  tape”  that  is  distinct  from  its  work  tapes.  Mp  initially 
invokes  Mr  to  write  r(n )  instances  of  the  letter  a  on  the  enumeration  tape  in  K\r(ri)  steps, 
after  which  it  returns  the  head  on  this  tape  to  its  initial  position. 

Suppose  that  M  computes  /  within  a  time  bound  of  r(n).  Mp  then  alternates  between 
simulating  one  step  of  M  on  its  work  tapes  and  advancing  its  head  on  the  enumeration 
tape.  When  M  halts,  Mp  continues  to  read  and  advance  the  head  on  its  enumeration  tape 
on  alternate  steps  until  it  encounters  a  blank.  Clearly,  Mp  halts  in  precisely  (ify  +  2)r(n) 
steps. 

Suppose  now  that  M  computes  /  in  space  r(n).  Mp  invokes  Mr  to  write  r(n)  special 
blank  symbols  on  each  of  its  work  tapes.  It  then  simulates  M,  treating  the  special  blank 
symbols  as  standard  blanks.  Thus,  Mp  uses  precisely  kr(ri)  cells  on  its  fc  work  tapes.  ■ 

Configuration  graphs,  defined  in  Section  5.3,  are  graphs  that  capture  the  state  of  Turing 
machines  with  potentially  unlimited  storage  capacity.  Since  all  resource  bounds  are  proper,  as 
we  know  from  Theorem  8.4.2,  all  DTMs  and  NDTMs  used  for  decision  problems  halt  on  all 
inputs.  Furthermore,  NDTMs  never  give  an  incorrect  answer.  Thus,  configuration  graphs  can 
be  assumed  to  be  acyclic. 

8.5  Classification  of  Decision  Problems 

In  this  section  we  classify  decision  problems  by  the  resources  they  consume  on  deterministic 
and  nondeterministic  Turing  machines.  We  begin  with  the  definition  of  complexity  classes. 

DEFINITION  8.5.1  Letr(n )  :  IN  IN  be  a  proper  resource  fimction.  Then  TIME(r(n))  and 
SPACE(r(n))  are  the  time  and  space  Turing  complexity  classes  containing  languages  that 
can  be  recognized  by  DTMs  that  halt  on  all  inputs  in  time  and  space  r(n),  respectively,  where  n  is 
the  length  of  an  input.  NTIME(r(n))  and  NSPACE(r(n))  are  the  nondeterministic  time 
and  space  Turing  complexity  classes,  respectively,  defined  for  NDTMs  instead  of  DTMs.  The 
union  of  complexity  classes  is  also  a  complexity  class. 

Let  k  be  a  positive  integer.  Then  TIMEjfc”)  and  NSPACE(nfc)  are  examples  of  complexity 
classes.  They  are  the  decision  problems  solvable  in  deterministic  time  kn  and  nondeterministic 
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space  nk,  respectively,  for  n  the  length  of  the  input.  Since  time  and  space  on  a  Turing  machine 
are  measured  by  the  number  of  steps  and  number  of  tape  cells,  it  is  straightforward  to  show 
that  time  and  space  for  a  given  Turing  machine,  deterministic  or  not,  can  each  be  reduced  by 
a  constant  factor  by  modifying  the  Turing  machine  description  so  that  it  acts  on  larger  units 
of  information.  (See  Problem  8.8.)  Thus,  for  a  constant  K  >  0  the  following  classes  are  the 
same:  a)  TIME(fc")  and  TIME(ATfcn),  b)  NTIME (kn)  and  NTIME(isTfcrl),  c)  SPACE(nfc) 
and  SPACE(is:nfc),  and  d)  NSPACE(nfe)  and  NSPACE(  ATnfc). 

To  emphasize  that  the  union  of  complexity  classes  is  another  complexity  class,  we  define 
as  unions  two  of  the  most  important  Turing  complexity  classes,  P,  the  class  of  deterministic 
polynomial-time  decision  problems,  and  NP,  the  class  of  nondeterministic  polynomial-time 
decision  problems. 

DEFINITION  8.5.2  The  classes  P  and  NP  are  sets  of  decision  problems  solvable  in  polynomial  time 
on  DTMs  and  NDTMs,  respectively;  that  is,  they  are  defined  as  follows: 

P  =  (J  TIME(nfe) 

k>0 

NP  =  (J  NTIME(nfc) 

k>  o 

Thus,  for  each  decision  problem  V  in  P  there  is  a  DTM  M  and  a  polynomial  p(n)  such 
that  M  halts  on  each  input  string  of  length  n  in  p{n )  steps,  accepting  this  string  if  it  is  an 
instance  w  of  V  and  rejecting  it  otherwise. 

Also,  for  each  decision  problem  V  in  NP  there  is  an  NDTM  M  and  a  polynomial  pin) 
such  that  for  each  instance  w  ofV,  |«i|  =  n,  there  is  a  choice  input  of  length  p(n)  such  that 
M  accepts  w  in  p{n)  steps. 

Problems  in  P  are  considered  feasible  problems  because  they  can  be  decided  in  time  poly¬ 
nomial  in  the  length  of  their  input.  Even  though  some  polynomial  functions,  such  as  ro1000, 
grow  very  rapidly  in  their  one  parameter,  at  the  present  time  problems  in  P  are  considered 
feasible.  Problems  that  require  exponential  time  are  not  considered  feasible. 

The  class  NP  includes  the  decision  problems  associated  with  many  hundreds  of  important 
searching  and  optimization  problems,  such  as  TRAVELING  SALESPERSON  described  below. 
(See  Fig.  8.4.)  If  P  is  equal  to  NP,  then  these  important  problems  have  feasible  solutions.  If 
not,  then  there  are  problems  in  NP  that  require  superpolynomial  time  and  are  therefore  largely 

infeasible.  Thus,  it  is  very  important  to  have  the  answer  to  the  question  P  ==  NP. 

TRAVELING  SALESPERSON 

Instance:  An  integer  k  and  a  set  of  n2  symmetric  integer  distances  {d;j  |  1  <  i,j  <  n} 

between  n  cities  where  d;j  =  dh;. 

Answer:  “Yes”  if  there  is  a  tour  (an  ordering)  {fy,  12,  ■  ■  . ,  in}  of  the  cities  such  that  the 

length  l  =  diui2  +  di2,j3  +  •  •  •  +  d;ntix  of  the  tour  satisfies  l  <  k. 

The  TRAVELING  SALESPERSON  problem  is  in  NP  because  a  tour  satisfying  l  <  k  can 
be  chosen  nondeterministically  in  n  steps  and  the  condition  l  <  k  then  verified  in  a  polyno¬ 
mial  number  of  steps  by  finding  the  distances  between  successive  cities  on  the  chosen  tour  in 
the  description  of  the  problem  and  adding  them  together.  (See  Problem  3.24.)  Many  other 
important  problems  are  in  NP,  as  we  see  in  Section  8.10.  While  it  is  unknown  whether  a 
deterministic  polynomial-time  algorithm  exists  for  this  problem,  it  can  clearly  be  solved  deter- 
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Figure  8.4  A  graph  on  which  the  TRAVELING  SALESPERSON  problem  is  defined.  The  heavy 
edges  identify  a  shortest  tour. 


ministically  in  exponential  time  by  enumerating  all  tours  and  choosing  the  one  with  smallest 
length.  (See  Problem  8.9.) 

The  TRAVELING  SALESPERSON  decision  problem  is  a  reduction  of  the  traveling  sales¬ 
person  optimization  problem,  whose  goal  is  to  find  the  shortest  tour  that  visits  each  city 
once.  The  output  of  the  optimization  problem  is  an  ordering  of  the  cities  that  has  the  short¬ 
est  tour.  By  contrast,  the  TRAVELING  SALESPERSON  decision  problem  reports  that  there  is 
or  is  not  a  tour  of  length  k  or  less.  Given  an  algorithm  for  the  optimization  problem,  the 
decision  problem  can  be  solved  by  calculating  the  length  of  an  optimal  tour  and  comparing 
it  to  the  parameter  k  of  the  decision  problem.  Since  the  latter  steps  can  be  done  in  polyno¬ 
mial  time,  if  the  optimization  algorithm  can  be  done  in  polynomial  time,  so  can  the  decision 
problem.  On  the  other  hand,  given  an  algorithm  for  the  decision  problem,  the  optimization 
problem  can  be  solved  through  bisection  as  follows:  a)  Since  the  length  of  the  shortest  tour 
is  in  the  interval  [n  iniri j>?  d-ij ,  n  max,;  j  d-if,  invoke  the  decision  algorithm  with  k  equal  to 
the  midpoint  of  this  interval,  b)  If  the  instance  is  a  “yes”  instance,  let  k  be  the  midpoint 
of  the  lower  half  of  the  current  interval;  if  not,  let  it  be  the  midpoint  of  the  upper  half,  c) 
Repeat  the  previous  step  until  the  interval  is  reduced  to  one  integer.  The  interval  is  bisected 
0(logn(maxij  dty  —  riling  cfyj))  times.  Thus,  if  the  decision  problem  can  be  solved  in 
polynomial  time,  so  can  the  optimization  problem. 

Whether  P  =  NP  is  one  of  the  outstanding  problems  of  computer  science.  The  current 
consensus  of  complexity  theorists  is  that  nondeterminism  is  such  a  powerful  specification  de¬ 
vice  that  they  are  not  equal.  We  return  to  this  topic  in  Section  8.8. 

8.5.1  Space  and  Time  Hierarchies 

In  this  section  we  state  without  proof  the  following  time  and  space  hierarchy  theorems.  (See 
[127,128].)  These  theorems  state  that  if  one  space  (or  time)  resource  bound  grows  sufficiently 
rapidly  relative  to  another,  the  set  of  languages  recognized  within  the  first  bound  is  strictly 
larger  than  the  set  recognized  within  the  second  bound. 

THEOREM  8.5. 1  (Time  Hierarchy  Theorem)  If  r(n )  >  n  is  a  proper  complexity  function, 
then  TIME(r(n))  is  strictly  contained  in  TIME(r  (n)  log r(n)). 


©John  E  Savage 


8.5  Classification  of  Decision  Problems 


337 


Let  r(n)  and  s(n)  be  proper  functions.  If  for  all  K  >0  there  exists  an  N0  such  that 
s(n)  >  Kr(n)  for  n  >  Nq,  we  say  that  r(n)  is  little  oh  of  s(n)  and  write  r{n)  =  o(s(n)). 

THEOREM  8.5.2  (Space  Hierarchy  Theorem)  If  r(n)  and  s(n)  are  proper  complexity  func¬ 
tions  and  r{n)  =  o(s(n)),  then  SPACE(r(n))  is  strictly  contained  in  SPACE(s(n)). 

Theorem  8.5.3  states  that  there  is  a  recursive  but  not  proper  resource  function  r(n)  such 
that  TIME(r(n))  and  TIME(2r(n))  are  the  same.  That  is,  for  some  function  r(n)  there  is  a 
gap  of  at  least  2rlra*  —  r(n)  in  time  over  which  no  new  decision  problems  are  encountered. 
This  is  a  weakened  version  of  a  stronger  result  in  [334]  and  independently  reported  by  [5 1] . 

THEOREM  8.5.3  (Gap  Theorem)  There  is  a  recursive  fimction  r[n)  :  B*  i— >  B*  such  that 
TIME(r(n))  =  TIME(2r(©. 

8.5.2  Time-Bounded  Complexity  Classes 

As  mentioned  earlier,  decision  problems  in  P  are  considered  to  be  feasible  while  the  class 
NP  includes  many  interesting  problems,  such  as  the  TRAVELING  SALESPERSON  problem, 
whose  feasibility  is  unknown.  Two  other  important  complexity  classes  are  the  deterministic 
and  nondeterministic  exponential-time  problems.  By  the  remarks  on  page  336,  TRAVELING 
SALESPERSON  clearly  falls  into  the  latter  class. 

DEFINITION  8.5.3  The  classes  EXPTIME  and  NEXPTIME  consist  of  those  decision  problems 
solvable  in  deterministic  and  nondeterministic  exponential  time ,  respectively,  on  a  Turing  machine. 
That  is, 

EXPTIME  =  (J  TIME!  T  ) 

k>  0 

NEXPTIME  =  (J  NTIME('2'"fc) 

k>  0 

We  make  the  following  observations  concerning  containment  of  these  complexity  classes. 
THEOREM  8.5.4  The  following  complexity  class  containments  hold: 

PCNPC  EXPTIME  C  NEXPTIME 

However,  P  C  EXPTIME,  that  is,  P  is  strictly  contained  in  EXPTIME. 

Proof  Since  languages  in  P  are  recognized  in  polynomial  time  by  a  DTM  and  such  machines 
are  included  among  the  NDTMs,  it  follows  immediately  that  P  C  NP.  By  similar  reasoning, 

EXPTIME  C  NEXPTIME. 

We  now  show  that  P  is  strictly  contained  in  EXPTIME.  P  C  TIME(2ra)  follows  be¬ 
cause  TIME(nfc)  C  TIME(2"  )  for  each  k  >  0.  By  the  Time  Hierarchy  Theorem  (The¬ 
orem  8.5.1),  we  have  that  TIME(2”)  C  TIME(?t2").  But  TIME(n2n)  C  EXPTIME. 
Thus,  P  is  strictly  contained  in  EXPTIME. 

Containment  of  NP  in  EXPTIME  is  deduced  from  the  proof  of  Theorem  5.2.2  by 
analyzing  the  time  taken  by  the  deterministic  simulation  of  an  NDTM.  If  the  NDTM 
executes  T  steps,  the  DTM  executes  0(kT)  steps  for  some  constant  k.  ■ 
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The  relationships  P  C  NP  and  EXPTIME  C  NEXPTIME  are  examples  of  a  more  general 
result,  namely,  TIME(r(n))  C  NTIME(r(n)),  where  these  two  classes  of  decision  problems 
can  respectively  be  solved  deterministically  and  nondeterministically  in  time  r(n),  where  n 
is  the  length  of  the  input.  This  result  holds  because  every  V  €  TIME(r(n))  of  length  n  is 
accepted  in  r(n )  steps  by  some  DTM  M-p  and  a  DTM  is  also  a  NDTM.  Thus,  it  is  also  true 
that  V  €  NTIME(r(n)). 

8.5.3  Space-Bounded  Complexity  Classes 

Many  other  important  space  complexity  classes  are  defined  by  the  amount  of  space  used  to 
recognize  languages  and  compute  functions.  We  highlight  five  of  them  here:  the  determin¬ 
istic  and  nondeterministic  logarithmic  space  classes  L  and  NL,  the  square-logarithmic  space 
class  L2,  and  the  deterministic  and  nondeterministic  polynomial-space  classes  PSPACE  and 
NPSPACE. 

DEFINITION  8.5.4  L  and  NL  are  the  decision  problems  solvable  in  logarithmic  space  on  a  DTM 
and  NDTM,  respectively.  L2  are  the  decision  problems  solvable  in  space  0( log2  n)  on  a  DTM. 
PSPACE  and  NPSPACE  are  the  decision  problems  solvable  in  polynomial  space  on  a  DTM  and 
NDTM,  respectively. 

Because  L  and  PSPACE  are  deterministic  complexity  classes,  they  are  contained  in  NL  and 
NPSPACE,  respectively:  that  is,  L  C  NL  and  PSPACE  C  NPSPACE. 

We  now  strengthen  the  latter  result  and  show  that  PSPACE  =  NPSPACE,  which  means 
that  nondeterminism  does  not  increase  the  recognition  power  of  Turing  machines  if  they  al¬ 
ready  have  access  to  a  polynomial  amount  of  storage  space. 

The  REACHABILITY  problem  on  directed  acyclic  graphs  defined  below  is  used  to  show  this 
result.  REACHABILITY  is  applied  to  configuration  graphs  of  deterministic  and  nondetermin¬ 
istic  Turing  machines.  Configuration  graphs  are  introduced  in  Section  5.3. 

REACHABILITY 

Instance:  A  directed  graph  G  =  ( V ,  E )  and  a  pair  of  vertices  u,  v  €  V . 

Answer:  “Yes”  if  there  is  a  directed  path  in  G  from  u  to  v. 

REACHABILITY  can  be  decided  by  computing  the  transitive  closure  of  the  adjacency  matrix 
of  G  in  parallel.  (See  Section  6.4.)  However,  a  simple  serial  RAM  program  based  on  depth- 
first  search  can  also  solve  the  reachability  problem.  Depth-first  search  (DFS)  on  an  undirected 
graph  G  visits  each  edge  in  the  forward  direction  once.  Edges  at  each  vertex  are  ordered.  Each 
time  DFS  arrives  at  a  vertex  it  traverses  the  next  unvisited  edge.  If  DFS  arrives  at  a  vertex  from 
which  there  are  no  unvisited  edges,  it  retreats  to  the  previously  visited  vertex.  Thus,  after  DFS 
visits  all  the  descendants  of  a  vertex,  it  backs  up,  eventually  returning  to  the  vertex  from  which 
the  search  began. 

Since  every  T-step  RAM  computation  can  be  simulated  by  an  0(T3)-step  DTM  computa¬ 
tion  (see  Problem  8.6),  a  cubic-time  DTM  program  based  on  DFS  exists  for  REACHABILITY. 
Unfortunately,  the  space  to  execute  DFS  on  the  RAM  and  Turing  machine  both  can  be  linear 
in  the  size  of  the  graph.  We  give  an  improved  result  that  allows  us  to  strengthen  PSPACE  C 
NPSPACE  to  PSPACE  =  NPSPACE. 

Below  we  show  that  REACHABILITY  can  be  realized  in  quadratic  logarithmic  space.  This 
fact  is  then  used  to  show  that  NSPACE(r(n))  C  SPACE(r2(n))  for  r(n)  =  fi(log  n). 
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THEOREM  8.5.5  (Savitch)  REACHABILITY  is  in  SPACE(log2  n). 

Proof  As  mentioned  three  paragraphs  earlier,  the  REACHABILITY  problem  on  a  graph  G  = 
(V,E)  can  be  solved  with  depth-first  search.  This  requires  storing  data  on  each  vertex  visited 
during  a  search.  This  data  can  be  as  large  as  0(n),  n  =  \V\.  We  exhibit  an  algorithm  that 
uses  much  less  space. 

Given  an  instance  of  REACHABILITY  defined  by  G  =  (V,E)  and  u,  v  £  V,  for  each 
pair  of  vertices  (a,  b)  and  integer  k  <  |~log2  n]  we  define  predicates  PATH  (a,  b,  2 k)  whose 
value  is  true  if  there  exists  a  path  from  a  to  b  in  G  whose  length  is  at  most  2k  and  false  other¬ 
wise.  Since  no  path  has  length  more  than  n,  the  solution  to  the  REACHABILITY  problem  is 
the  value  of  PATH  (u,v,  2  "1 ) .  The  predicates  PATH  ©,  b,  2°)  are  true  if  either  a  =  b 

or  there  is  a  path  of  length  1  (an  edge)  between  the  vertices  a  and  b.  Thus,  PATH  (a,  b,  2°) 
can  be  evaluated  directly  by  consulting  the  problem  instance  on  the  input  tape. 

The  algorithm  that  computes  PATH©  v,  2^°S2  )  with  space  0(log"n)  uses  the 

fact  that  any  path  of  length  at  most  2k  can  be  decomposed  into  two  paths  of  length  at 
most  2fc_1.  Thus,  if  PATH©  b,  2©  is  true,  then  there  must  be  some  vertex  z  such  that 
PATH  (a,  2,  2k~l)  and  PATH©  b,  2k~l)  are  both  true.  The  truth  of  PATH  (a,  b,  2k )  can 
be  established  by  searching  for  a  z  such  that  PATH  (a,  z,  2fc_©  is  true.  Upon  finding  one, 
we  determine  the  truth  of  PATH©  b,  2fc_1).  Failing  to  find  such  a  z,  PATH©  b,  2©  is 
declared  to  be  false.  Each  evaluation  of  a  predicate  is  done  in  the  same  fashion,  that  is,  re¬ 
cursively.  Because  we  need  evaluate  only  one  of  PATH  (a,  z,  2fe~1)  and  PATH©  b,  2fc_I) 
at  a  time,  space  can  be  reused. 

We  now  describe  a  deterministic  Turing  machine  with  an  input  tape  and  two  work  tapes 
computing  PATH©,  v,  2^og2  ral  ) .  The  input  tape  contains  an  instance  of  REACHABILITY, 
which  means  it  has  not  only  the  vertices  u  and  v  but  also  a  description  of  the  graph  G.  The 
first  work  tape  will  contain  triples  of  the  form  (a,  b,  k ),  which  are  called  activation  records. 
This  tape  is  initialized  with  the  activation  record  (it,  v,  [log2  n] ).  (See  Fig.  8.5.) 

The  DTM  evaluates  the  last  activation  record,  ( a,b,k ),  on  the  first  work  tape  as  de¬ 
scribed  above.  There  are  three  kinds  of  activation  records,  complete  records  of  the  form 
(a,  b,  k),  initial  segments  of  the  form  (a,  z,k—  1),  and  final  segments  of  the  form  (z,  b,k  — 
1).  The  first  work  tape  is  initialized  with  the  complete  record  (it,  v,  |~log2  n\ ). 

An  initial  segment  is  created  from  the  current  complete  record  (a,  b,  k)  by  selecting  a 
vertex  z  to  form  the  record  (a,  z,k  —  1),  which  becomes  the  current  complete  record.  If 
it  evaluates  to  true,  it  can  be  determined  to  be  an  initial  or  final  segment  by  examining  the 
previous  record  (a,  b,  k ).  If  it  evaluates  to  false,  (a,  z,k  —  1)  is  erased  and  another  value 
of  z,  if  any,  is  selected  and  another  initial  segment  placed  on  the  work  tape  for  evaluation. 
If  no  other  z  exists,  (a,  z,k  —  1)  is  erased  and  the  expression  PATH©,  b,  2©  is  declared 
false.  If  (a,z,k  —  1)  evaluates  to  true,  the  final  record  {z,b,k  —  1)  is  created,  placed  on  the 
work  tape,  and  evaluated  in  the  same  fashion.  As  mentioned  in  the  second  paragraph  of  this 
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Figure  8.5  A  snapshot  of  the  stack  used  by  the  REACHABILITY  algorithm  in  which  the  com¬ 
ponents  of  an  activation  record  (a,  b,  k)  are  distributed  over  several  cells. 
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proof,  (a,  b,  0)  is  evaluated  by  consulting  the  description  of  the  graph  on  the  input  tape.  The 
second  work  tape  is  used  for  bookkeeping,  that  is,  to  enumerate  values  of  z  and  determine 
whether  a  segment  is  initial  or  final. 

The  second  work  tape  uses  space  O(logn).  The  first  work  tape  contains  at  most 
|~log2  n\  activation  records.  Each  activation  record  (a,  b,  k)  can  be  stored  in  O(logn)  space 
because  each  vertex  can  be  specified  in  O(logn)  space  and  the  depth  parameter  k  can  be 
specified  in  0(log  k)  =  0(log  log  n)  space.  It  follows  that  the  first  work  tape  uses  at  most 
0( log2  n)  space.  ■ 

The  following  general  result,  which  is  a  corollary  of  Savitch’s  theorem,  demonstrates  that 
nondeterminism  does  not  enlarge  the  space  complexity  classes  if  they  are  defined  by  space 
bounds  that  are  at  least  logarithmic.  In  particular,  it  implies  that  PSPACE  =  NPSPACE. 

COROLLARY  8.5. 1  Let  r(n)  be  a  proper  Turing  computable  function  r  :  IN  i— >  IN'  satisfying 
r(n)  =  ft  (log  n).  7/kwNSPACE(r(n))  C  SPACE(r2(n)). 

Proof  Let  Mnd  be  an  NDTM  with  input  and  output  tapes  and  s  work  tapes.  Let  it  recog¬ 
nize  a  language  L  £  NSPACE(r(n)).  For  each  input  string  w ,  we  generate  a  configuration 
graph  G(Mnd>  w)  of  Mnd-  (See  Fig.  8.6.)  We  use  this  graph  to  determine  whether  or  not 
w  £  L.  Mnd  has  at  most  \Q\  states,  each  tape  cell  can  have  at  most  c  values  (there  are 
c(s+2)r(n)  configUrations  for  the  s  +  2  tapes),  the  s  work  tape  heads  and  the  output  tape 
head  can  assume  values  in  the  range  l  <  hj  <  r(n),  and  the  input  head  hs+i  can  assume 
one  of  n  positions  (there  are  nr(n)s+1  configurations  for  the  tape  heads).  It  follows  that 
Mnd  has  at  most  \Q\c^s+2^n\nr(n)s+l)  <  klosn+r^  configurations.  G(M^d,w) 
has  the  same  number  of  vertices  as  there  are  configurations  and  a  number  of  edges  at  most 
the  square  of  its  number  of  vertices. 

Let  L  £  NSPACE(r(n))  be  recognized  by  an  NDTM  Mnd-  We  describe  a  determin¬ 
istic  r2(n) -space  Turing  machine  Md  recognizing  L.  For  input  string  w  £  L  of  length  n, 
this  machine  solves  the  REACHABILITY  problem  on  the  configuration  graph  G(Mnd>  w) 
of  Mnd  described  above.  However,  instead  of  placing  on  the  input  tape  the  entire  configu¬ 
ration  graph,  we  place  the  input  string  w  and  the  description  of  Mnd-  We  keep  configura¬ 
tions  on  the  work  tape  as  part  of  activation  records  (they  describe  vertices  of  G(ALnd>  w)). 


Figure  8.6  The  acyclic  configuration  graph  G’(  Af\o,  w)  of  a  nondeterministic  Turing  machine 
Mnd  on  input  w  has  one  vertex  for  each  configuration  of  Mnd  .  Here  heavy  edges  identify  the 
nondeterministic  choices  associated  with  a  configuration. 
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Each  of  the  vertices  (configurations)  adjacent  to  a  particular  vertex  can  be  deduced  from  the 
description  of  Mnd- 

Since  the  number  of  configurations  of  Mnd  is  N  =  O  (fclogrl+r(")),  each  configura¬ 
tion  or  activation  record  can  be  stored  as  a  string  of  length  0(r(n)). 

From  Theorem  8.5.5,  the  reachability  in  G(Mnd>  w)  of  the  final  configuration  from 
the  initial  one  can  be  determined  in  space  0( log2  N).  But  N  =  O  (fc losn+r(n ©  from 
which  it  follows  that  NSPACE(r(n))  C  SPACE(r2(n)).  ■ 

The  classes  NL,  L2  and  PSPACE  are  defined  as  unions  of  deterministic  and  nondetermin- 
istic  space-bounded  complexity  classes.  Thus,  it  follows  from  this  corollary  that  NL  cl2c 
PSPACE.  However,  because  of  the  space  hierarchy  theorem  (Theorem  8.5.2),  it  follows  that 
L2  is  contained  in  but  not  equal  to  PSPACE,  denoted  L2  C  PSPACE. 

8.5.4  Relations  Between  Time-  and  Space-Bounded  Classes 

In  this  section  we  establish  a  number  of  complexity  class  containment  results  involving  both 
space-  and  time-bounded  classes.  We  begin  by  proving  that  the  nondeterministic  0(r(n))- 
space  class  is  contained  within  the  deterministic  O  (/Cn))-time  class.  This  implies  that  NL  C 

P  and  NPSPACE  C  EXPTIME. 

THEOREM  8.5.6  The  classes  NSPACE(r(n))  and  TIME(r(n))  of  decision  problems  solvable  in 
nondeterministic  space  and  deterministic  time  r(n),  respectively,  satisfy  the  following  relation  for 
some  constant  k  >  0: 

NSPACE(r(n))  C  TIME(/fclosn+r(71)) 

Proof  Let  Mnd  accept  a  language  L  £  NSPACE(r(n))  and  let  G(M^d,w)  be  the 
configuration  graph  for  Mnd  on  input  w.  To  determine  if  w  is  accepted  by  Mnd  and 
therefore  in  L,  it  suffices  to  determine  if  there  is  a  path  in  G(Mnd>'u4  from  the  initial 
configuration  of  Mnd  to  the  final  configuration.  This  is  the  REACHABILITY  problem, 
which,  as  stated  in  the  proof  of  Theorem  8.5.5,  can  be  solved  by  a  DTM  in  time  polynomial 
in  the  length  of  the  input.  When  this  algorithm  needs  to  determine  the  descendants  of  a 
vertex  in  G(Mnd>w)>  it  consults  the  definition  of  Mnd  to  determine  the  configurations 
reachable  from  the  current  configuration.  It  follows  that  membership  of  w  in  L  can  be 
determined  in  time  O  «+»■("))  for  some  k  >  1  or  that  L  is  in  TIME(/closn+r(rl)).  ■ 

COROLLARY  8.5.2  NL  C  P  and  NPSPACE  C  EXPTIME 

Later  we  explore  the  polynomial-time  problems  by  exhibiting  other  important  complexity 
classes  that  reside  inside  P.  (See  Section  8.15.)  We  now  show  containment  of  the  nondeter¬ 
ministic  time  complexity  classes  in  deterministic  space  classes. 

THEOREM  8.5.7  The  following  containment  holds: 

NTIME(r(n))  C  SPACE(r(n)) 

Proof  We  use  the  construction  of  Theorem  5.2.2.  Let  L  be  a  language  in  NTIME(r  («))■ 
We  note  that  the  choice  string  on  the  enumeration  tape  converts  the  nondeterministic  recog¬ 
nition  of  L  into  deterministic  recognition.  Since  L  is  recognized  in  time  r(n)  for  some 
accepting  computation,  the  deterministic  enumeration  runs  in  time  r(n)  for  each  choice 
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NP  U  coNP 


Figure  8.7  The  relationships  among  complexity  classes  derived  in  this  section.  Containment  is 
indicated  by  arrows. 


string.  Thus,  0(r(n))  cells  are  used  on  the  work  and  enumeration  tapes  in  this  determinis¬ 
tic  simulation  and  L  is  in  PSPACE.  ■ 

An  immediate  corollary  to  this  theorem  is  that  NP  C  PSPACE.  This  implies  that  P  C 
EXPTIME.  However,  as  mentioned  above,  P  is  strictly  contained  within  EXPTIME. 
Combining  these  results,  we  have  the  following  complexity  class  inclusions: 

LCNLCPCNPC  PSPACE  C  EXPTIME  C  NEXPTIME 

where  PSPACE  =  NPSPACE.  We  also  have  L2  C  PSPACE,  and  P  C  EXPTIME,  which 
follow  from  the  space  and  time  hierarchy  theorems.  These  inclusions  and  those  derived  below 
are  shown  in  Fig.  8.7. 

In  Section  8.6  we  develop  refinements  of  this  partial  ordering  of  complexity  classes  by  using 
the  complements  of  complexity  classes. 

We  now  digress  slightly  to  discuss  space-bounded  functions. 

8.5.5  Space- Bounded  Functions 

We  digress  briefly  to  specialize  Theorem  8.5.6  to  log-space  computations,  not  just  log-space 
language  recognition.  As  the  following  demonstrates,  log-space  computable  functions  are  com¬ 
putable  in  polynomial  time. 

THEOREM  8.5.8  Let  M  be  a  D  TM  that  halts  on  all  inputs  using  space  0  (log  n)  to  process  inputs 
of  length  n.  Then  M  executes  a  polynomial  number  of  steps. 

Proof  In  the  proof  of  Corollary  8.5.1  the  number  of  configurations  of  a  Turing  machine  M 
with  input  and  output  tapes  and  s  work  tapes  is  counted.  We  repeat  this  analysis.  Let  r(n) 
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be  the  maximum  number  of  tape  cells  used  and  let  c  be  the  maximal  size  of  a  tape  alphabet. 
Then,  M  can  be  in  one  of  at  most  \  <  fys+2)T’(-nfyn7'(n)s+1)  =  0[kr (”))  configurations 
for  some  k  >  1 .  Since  M  always  halts,  by  the  pigeonhole  principle,  it  passes  through  at 
most  x  configurations  in  at  most  x  steps.  Because  r(n)  =  0( log  n),  x  =  0{nd )  for  some 
integer  d.  Thus,  M  executes  a  polynomial  number  of  steps.  ■ 

8.6  Complements  of  Complexity  Classes 

As  seen  in  Section  4.6,  the  regular  languages  are  closed  under  complementation.  However,  we 
have  also  seen  in  Section  4.13  that  the  context-free  languages  are  not  closed  under  comple¬ 
mentation.  Thus,  complementation  is  a  way  to  develop  an  understanding  of  the  properties  of 
a  class  of  languages.  In  this  section  we  show  that  the  nondeterministic  space  classes  are  closed 
under  complements.  The  complements  of  languages  and  decision  problems  were  defined  at 
the  beginning  of  this  chapter. 

Consider  REACHABILITY.  Its  complement  REACHABILITY  is  the  set  of  directed  graphs 
G  =  (V,E)  and  pairs  of  vertices  u,v  £  V  such  that  there  are  no  directed  paths  between  u 
and  v.  It  follows  that  the  union  of  these  two  problems  is  not  the  entire  set  of  strings  over  B* 
but  the  set  of  all  instances  consisting  of  a  directed  graph  G  =  ( V ,  E)  and  a  pair  of  vertices 
u,  v  £  V .  This  set  is  easily  detected  by  a  DTM.  It  must  only  verify  that  the  string  describing  a 
putative  graph  is  in  the  correct  format  and  that  the  representations  for  u  and  v  are  among  the 
vertices  of  this  graph. 

Given  a  complexity  class,  it  is  natural  to  define  the  complement  of  the  class. 

DEFINITION  8.6. 1  The  complement  of  a  complexity  class  of  decision  problems  C,  denoted 
co  C,  is  the  set  of  decision  problems  that  are  complements  of  decision  problems  in  C. 

Our  first  result  follows  from  the  definition  of  the  recognition  of  languages  by  DTMs. 

THEOREM  8.6. 1  If  C  is  a  deterministic  time  or  space  complexity  class,  then  co C  =  C. 

Proof  Every  L  £  C  is  recognized  by  a  DTM  M  that  halts  within  the  resource  bound 
of  C  for  every  string,  whether  in  L  or  L,  the  complement  of  L.  Create  M  from  M  by 
complementing  the  accept/reject  status  of  states  of  M’s  control  unit.  Thus,  L,  which  by 
definition  is  in  co C,  is  also  in  C.  That  is,  co C  C  C.  Similarly,  C  C  co C.  Thus,  co C  =  C.  ■ 

In  particular,  this  result  says  that  the  class  P  is  closed  under  complements.  That  is,  if  the 
“yes”  instances  of  a  decision  problem  can  be  answered  in  deterministic  polynomial  time,  then 
so  can  the  “No”  instances. 

We  use  the  above  theorem  and  Theorem  5.7.6  to  give  another  proof  that  there  are  problems 
that  are  not  in  P. 

COROLLARY  8.6. 1  There  are  languages  not  in  P,  that  is,  languages  that  cannot  be  recognized 
deterministically  in  polynomial  time. 

Proof  Since  every  language  in  P  is  recursive  and  C\  defined  in  Section  5.7.2  is  not  recursive, 
it  follows  that  C\  is  not  in  P.  ■ 

We  now  show  that  all  nondeterministic  space  classes  with  a  sufficiently  large  space  bound 
are  also  closed  under  complements.  This  leaves  open  the  question  whether  the  nondetermin- 
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istic  time  classes  are  closed  under  complement.  As  we  shall  see,  this  is  intimately  related  to  the 

? 

question  P  =  NP. 

As  stated  in  Definition  5.2.1,  for  no  choices  of  moves  is  an  NDTM  allowed  to  produce  an 
answer  for  which  it  is  not  designed.  In  particular,  when  computing  a  function  it  is  not  allowed 
to  give  a  false  answer  for  any  set  of  nondeterministic  choices. 

THEOREM  8.6.2  (Immerman-Szelepscenyi)  Given  a  graph  G  =  (V ,  E)  and  a  vertex  v,  the 
number  of  vertices  reachable  from  v  can  be  computed  by  an  NDTM  in  space  0(log  n),  n  =  \  V  |. 

Proof  Let  V  =  {1,  2, .  . . ,  n}.  Any  node  reachable  from  a  vertex  v  must  be  reachable  via  a 
path  of  length  (number  of  edges)  of  at  most  n  —  1,  n  =  \V  |.  Let  R(k,  u )  be  the  number 
of  vertices  of  G  reachable  from  u  by  paths  of  length  k  or  less.  The  goal  is  to  compute 
R(n  —  l,u).  A  deterministic  program  for  this  purpose  could  be  based  on  the  predicate 
PATH  (it,  v,  k )  that  has  value  1  if  there  is  a  path  of  length  k  or  less  from  vertex  u  to  vertex 
v  and  0  otherwise  and  the  predicate  ADJACENT-OR-IDENTICAL(x,  v)  that  has  value  1  if 
x  =  v  or  there  is  an  edge  in  G  from  x  to  v  and  0  otherwise.  (See  Fig.  8.8.)  If  we  let  the 
vertices  be  associated  with  the  integers  in  the  interval  [1, . . . ,  n],  then  R(n  —  1,  it)  can  be 
evaluated  as  follows: 

R(n—  l,it)=  ^  PATH(it,  v,  n  —  1) 

1  <.v<n 

=  V  E  PATH(m,  x,  n  -  2)adjacent-or-equal(x,  v) 

1  <v<n 1 <#<n 


When  this  description  of  R(n  —  1,  it)  is  converted  to  a  program,  the  amount  of  storage 
needed  grows  more  rapidly  than  0(log  n).  However,  if  the  inner  use  of  PATH  (it,  x,n  —  2) 
is  replaced  by  the  nonrecursive  and  nondeterministic  test  EXISTS-PATH-FROM-it-TO-i;-< 
LENGTH  of  Fig.  8.9  for  a  path  from  u  to  x  of  length  n  —  2,  then  the  space  can  be  kept  to 
0(log  n).  This  test  nondeterministically  guesses  paths  but  verifies  deterministically  that  all 
paths  have  been  explored. 

The  procedure  COUNTING-REACHABILITY  of  Fig.  8.9  is  a  nondeterministic  program 
computing  R(n  —  l,it).  It  uses  the  procedure  #-VERTICES-AT-<-DISTANCE-FROM-ii 
to  compute  the  number  of  vertices  at  distance  dist  or  less  from  u  in  order  of  increasing 
values  of  dist.  (It  computes  dist  correctly  or  fails.)  This  procedure  has  prevjnurri-dist 
as  a  parameter,  which  is  the  number  of  vertices  at  distance  dist  —  1  or  less.  It  passes  this 


v 


x  =  v 


(b) 


Figure  8.8  Paths  explored  by  the  REACHABILITY  algorithm.  Case  (a)  applies  when  x  and  v  are 
different  and  (b)  when  they  are  the  same. 
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counting-reachability(m) 

{R(k,  it)  =  number  of  vertices  at  distance  <  k  from  it  in  G  =  (V,  E)} 
prevjnum-dist  :=  1;  {num.dist  =  R(0,u)} 
for  dist  :=  1  to  n  —  1 

num-dist  :=  #-VERTlCES-AT-<-DlST-FROM-it(disi,  u,prev  mum -dist) 
prev-nurri-dist  :=  num-dist 
{nurri-dist  =  R(dist,u)} 
return  ( num  -di  st ) 

#-VERTICES-AT-<-DISTANCE-FROM-u((iist,  u,  prev-TiumjdisR) 

{Returns  R(dist,u)  given  prevjnum-dist  =  R(dist  —  l,u)  or  fails} 
nurrunodes  :=  0 
for  last  .node  :=  1  to  n 

if  IS-NODE-AT-<-DIST-FROM-u(disf,  u,  last  jnode,  prev  jnum-dist)  then 
nurruiodes  :=  numjiodes  +  1 
return  ( numjnodes ) 

lS-NODE-AT-<-DlST-FROM-u((i*sf,  u,  last  jnode,  prev  jnum -dist) 

{numjnode  =  number  of  vertices  at  distance  <  dist  from  u  found  so  far} 
numjnode  :=  0; 
reply  :=  false 

for  next-toJast-node  :=  1  to  n 

if  EXISTS-PATH-FROM-it-TO-t;-<-LENGTH(M,  next -to -last -node,  dist  —  1)  then 
numjnode  :=  numjnode  +  1  {count  number  of  next-to-last  nodes  or  fail} 
if  ADJACENT-OR-IDENTICAL(nea;f_foJasf_no(ie,  last  jnode)  then 
reply  :=  true 

if  numjnode  <  prevjnum-dist  then 
fail 

else  return  {reply) 

EXISTS-PATH-FROM-w-TO-tt-f^ -LENGTH (it,  V,  dist) 

{nondeterministically  choose  at  most  dist  vertices,  fail  if  they  don’t  form  a  path} 

nodeA  :=  u 

for  count  :=  1  to  dist 

node-2  :=  NONDETERMINISTIC-GUESS([l,  ..,n]) 
if  not  ADJACENT-OR-IDENTICAL(node_l,  nodeA)  then 
fail 

else  nodeA  :=  node-2 
if  node-2  =  v  then 
return(true) 
else 

return(false) 


Figure  8.9  A  nondeterministic  program  counting  vertices  reachable  from  it.  Comments  are 
enclosed  in  braces  {,  }. 
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value  to  the  procedure  IS-NODE-AT-<-DIST-FROM-m,  which  examines  and  counts  all  pos¬ 
sible  next _to -last jnodes  reachable  from  u.  #-VERTICES-AT-<-DISTANCE-FROM-u  ei¬ 
ther  fails  to  find  all  possible  vertices  at  distance  dist  —  1,  in  which  case  it  fails,  or  finds  all 
such  vertices.  Thus,  it  nondeterministically  verifies  that  all  possible  paths  from  u  have  been 
explored.  IS-NODE-AT-<-DIST-FROM-w  uses  the  procedure  EXISTS-PATFF-FROM-it-TO- 
w-<-LENGTH  that  either  correctly  verifies  that  a  path  of  length  dist  —  1  exists  from  u  to 
next-todast-node  or  fails.  In  turn,  EXISTS-PATH-FROM-w-TO-t;-<-LENGTH  uses  the 
command  NONDETERMINISTIC-GUESS([l, .., n\)  to  nondeterministically  choose  nodes 
on  a  path  from  u  to  v. 

Since  this  program  is  not  recursive,  it  uses  a  fixed  number  of  variables.  Because  these 
variables  assume  values  in  the  range  [1,  2,  3, .  . . ,  n],  it  follows  that  space  0(log  n)  suffices 
to  implement  it  on  an  NDTM.  ■ 

We  now  extend  this  result  to  nondeterministic  space  computations. 

COROLLARY  8.6.2  If  r(n)  =  ffylogn)  is  proper,  NSPACE(r(n))  =  coNSPACE(r(n)). 

Proof  Let  L  £  NSPACE(r(n))  be  decided  by  an  r(n)-space  bounded  NDTM  M.  We 
show  that  the  complement  of  L  can  be  decided  by  a  nondeterministic  r(n)-space  bounded 
Turing  machine  M,  stopping  on  all  inputs.  We  modify  slightly  the  program  of  Fig.  8.9  for 
this  purpose.  The  graph  G  is  the  configuration  graph  of  M .  Its  initial  state  is  determined 
by  the  string  w  that  is  initially  written  on  M’s  input  tape.  To  determine  adjacency  between 
two  vertices  in  the  configuration  graph,  computations  of  M  are  simulated  on  one  of  M’s 
work  tapes. 

M  computes  a  slightly  modified  version  of  COUNTING-REACFFABILITY.  First,  if  the 
procedure  IS-NODE-AT-LENGTFF-<-DIST-FROM-i(  returns  true  for  a  vertex  u  that  is  a 
halting  accepting  configuration  of  M,  then  M  halts  and  rejects  the  string.  If  the  procedure 
COUNTING-REACFIABILITY  completes  successfully  without  rejecting  any  string,  then  M 
halts  and  accepts  the  input  string  because  every  possible  accepting  computation  for  the  input 
string  has  been  examined  and  none  of  them  is  accepting.  This  computation  is  nondetermin¬ 
istic. 

The  space  used  by  M  is  the  space  needed  for  COUNTING-REACFFABILITY,  which 
means  it  is  O(logiV),  where  N  is  the  number  of  vertices  in  the  configuration  graph  of 
M  plus  the  space  for  a  simulation  of  M,  which  is  0(r(n)).  Since  N  =  0(kl° s“+r(")) 
(see  the  proof  of  Theorem  8.5.6),  the  total  space  for  this  computation  is  0(log  n  +  r(n)), 
which  is  0(r(n))  if  r(n)  =  fl(logn).  By  definition  L  £  coNSPACE(r(n)).  From  the 
above  construction  L  £  NSPACE(r(n)).  Thus,  coNSPACE(r(n))  C  NSPACE(r(«)). 

By  similar  reasoning,  if  L  £  coNSPACE(r(n)),  then  L  £  NSPACE(r(n)),  which  im¬ 
plies  that  NSPACE(r(n))  C  coNSPACE(r(n));  that  is,  they  are  equal.  ■ 

The  lowest  class  in  the  space  hierarchy  that  is  known  to  be  closed  under  complements  is 
the  class  NL;  that  is,  NL  =  coNL.  This  result  is  used  in  Section  8.11  to  show  that  the  problem 
2-SAT,  a  specialization  of  the  NP-complete  problem  3-SAT,  is  in  P. 

From  Theorem  8.6.1  we  know  that  all  deterministic  time  and  space  complexity  classes  are 
closed  under  complements.  From  Corollary  8.6.2  we  also  know  that  all  nondeterministic  space 
complexity  classes  with  space  fl(logn)  are  closed  under  complements.  However,  we  do  not 
yet  know  whether  the  nondeterministic  time  complexity  classes  are  closed  under  complements. 
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This  important  question  is  related  to  the  question  whether  P  =  NP,  because  if  NP  f  coNP, 
then  P  f  NP  because  P  is  closed  under  complements  but  NP  is  not. 

8.6.1  The  Complement  of  NP 

The  class  coNP  is  the  class  of  decision  problems  whose  complements  are  in  NP.  That  is, 
coNP  is  the  language  of  “No”  instances  of  problems  in  NP.  The  decision  problem  VALIDITY 
defined  below  is  an  example  of  a  problem  in  coNP.  In  fact,  it  is  log-space  complete  for  coNP. 
(See  Problem  8.10.)  VALIDITY  identifies  SOPEs  (the  sum-of-products  expansion,  defined  in 
Section  2.3)  that  can  have  value  1. 

VALIDITY 

Instance:  A  set  of  literals  X  =  {x\,X\,X2,X2,  ■  ■  ■  ,xn,xn},  and  a  sequence  of  products 

P  =  (pi,£>2>  ■  •  •  >Pm)>  where  each  product  Pi  is  a  subset  of  X. 

Answer:  “Yes”  if  for  all  assignments  of  Boolean  values  to  variables  in  {x\ ,  X2,  ■  ■  ■ ,  xn}  every 

literal  in  at  least  one  product  has  value  1 . 

Given  a  language  L  in  NP,  a  string  in  L  has  a  certificate  for  its  membership  in  L  consisting 
of  the  set  of  choices  that  cause  its  recognizing  Turing  machine  to  accept  it.  For  example,  a 
certificate  for  SATISFIABILITY  is  a  set  of  values  for  its  variables  satisfying  at  least  one  literal 
in  each  sum.  For  an  instance  of  a  problem  in  coNP,  a  disqualification  is  a  certificate  for  the 
complement  of  the  instance.  An  instance  in  coVALIDITY  is  disqualified  by  an  assignment  that 
causes  all  products  to  have  value  0.  Thus,  each  “Yes”  instance  in  VALIDITY  is  disqualified  by 
an  assignment  that  prevents  the  expression  from  being  valid.  (See  Problem  8.11.) 

As  mentioned  just  before  the  start  of  this  section,  if  NP  f  coNP,  then  P  f  NP  because  P 
is  closed  under  complements.  Because  we  know  of  no  way  to  establish  NP  f  coNP,  we  try  to 
identify  a  problem  that  is  in  NP  but  is  not  known  to  be  in  P.  A  problem  that  is  NP  and  coNP 
simultaneously  (the  class  NP  O  coNP)  is  a  possible  candidate  for  a  problem  that  is  in  NP  but 
not  P,  which  would  show  that  P  f  NP.  We  show  that  PRIMALITY  is  in  NP  (~1  coNP.  (It  is 
straightforward  to  show  that  P  C  NP  [~l  coNP.  See  Problem  8.12.) 

PRIMALITY 

Instance:  An  integer  n  written  in  binary  notation. 

Answer:  “Yes”  if  n  is  a  prime. 

A  disqualification  for  PRIMALITY  is  an  integer  that  is  a  factor  of  n.  Thus,  the  complement 
of  PRIMALITY  is  in  NP,  so  PRIMALITY  is  in  coNP.  We  now  show  that  PRIMALITY  is  also  in 
NP  or  that  it  is  in  NP  n  coNP.  To  prove  the  desired  result  we  need  the  following  result  from 
number  theory,  which  we  do  not  prove  (see  [235,  p.  222]  for  a  proof). 

THEOREM  8.6.3  An  integer  p  >  2  is  prime  if  and  only  if  there  is  an  integer  1  <  r  <  p  such  that 
rp~l  =  1  mod  p  and  for  all  prime  divisors  q  ofp  —  1,  f  1  mod  p. 

As  a  consequence,  to  give  evidence  of  primality  of  an  integer  p  >  1,  we  need  only  provide 
an  integer  r,  1  <  r  <  p,  and  the  prime  divisors  {q\, . . . ,  qf\  other  than  1  of  p  —  1  and  then 
show  that  rp~l  =  1  mod  p  and  Ap~l^q  f  1  mod  p  for  q  £  {q  1, . . . ,  qu}-  By  the  theorem, 
such  integers  exist  if  and  only  if  p  is  prime.  In  turn,  we  must  give  evidence  that  the  integers 
{q\, .  .  . ,  qk}  are  prime  divisors  of  p  —  1,  which  requires  showing  that  they  divide  p  —  1  and 
are  prime.  We  must  also  show  that  k  is  small  and  that  the  recursive  check  of  the  primes  does 
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not  grow  exponentially.  Evidence  of  the  primality  of  the  divisors  can  be  given  in  the  same  way, 
that  is,  by  exhibiting  an  integer  Tj  for  each  prime  as  well  as  the  prime  divisors  of  qj  —  1  for 
each  prime  qj .  We  must  then  show  that  all  of  this  evidence  can  be  given  succinctly  and  verified 
deterministically  in  time  polynomial  in  the  length  n  of  p. 

THEOREM  8.6.4  PRIMALITY  is  in  NP  D  coNP. 

Proof  We  give  an  inductive  proof  that  PRIMALITY  is  in  NP.  For  a  prime  p  we  give  its 
evidence  E(p)  as  (p;  r,  E{q\), . . . ,  E{qi~)),  where  E(qj)  is  evidence  for  the  prime  qj.  We 
let  the  evidence  for  the  base  case  p  =  2  be  E( 2)  =  (2).  Then,  E( 3)  =  (3;  2,  (2))  because 
r  =  2  works  for  this  case  and  2  is  the  only  prime  divisor  of  3  —  1 ,  and  (2)  is  the  evidence  for 
it.  Also,  £(5)  =  (5;  3,  (2)).  The  length  |£(p)|  of  the  evidence  E(p)  on  p  is  the  number 
of  parentheses,  commas  and  bits  in  integers  forming  part  of  the  evidence. 

We  show  by  induction  that  \E(p)\  is  at  most  4 log \p.  The  base  case  satisfies  the  hy¬ 
pothesis  because  | £7(2) |  =  4. 

Because  the  prime  divisors  {^i, ...,  qk}  satisfy  qi  >  2  and  q\q2  •  •  ■  qk  <  p—  1,  it  follows 
that  k  <  [log2  p\  —  n-  Also,  since  p  is  prime,  it  is  odd  and  p  —  1  is  divisible  by  2.  Thus, 
the  first  prime  divisor  of  p  —  1  is  2. 

Let  E(p)  =  (p'r,E(2),E(q2),...,E(qk)).  Let  the  inductive  hypothesis  be  that 
\E(p)\  <  41og2p.  Let  rij  =  \og2qj.  From  the  definition  of  E(p)  we  have  that  \E(p)\ 
satisfies  the  following  inequality  because  at  most  n  bits  are  needed  for  p  and  r,  there  are 
k  —  1  <  n  —  1  commas  and  three  other  punctuation  marks,  and  | £7(2)  |  =  4. 

\E(p)\  <  3n  +  6  +  4  n2j 

2  <j<k 

Since  the  qj  are  the  prime  divisors  of  p  —  1  and  some  primes  may  be  repeated  in  p  —  1 , 
their  product  (which  includes  q\  =  2)  is  at  most  p  —  1.  It  follows  that  Yl2<j<k  ni  — 
log2  H2<j<kqj  <  log((p  —  1)/ 2) .  Since  the  sum  of  the  squares  of  rij  is  less  than  or  equal 
to  the  square  of  the  sum  of  rij ,  it  follows  that  the  sum  in  the  above  expression  is  at  most 
(log 2p—  l)2  <  (n  —  l)2.  But  3n  +  6  +  4(n  —  l)2  =  4n2  —  5n+  10  <  4n2  when  n  >  2. 
Thus,  the  description  of  a  certificate  for  the  primality  of  p  is  polynomial  in  the  length  n  of  p. 

We  now  show  by  induction  that  a  prime  p  can  be  verified  in  0(n4)  steps  on  a  RAM. 
Assume  that  the  divisors  q\, ...  ,qu  for  p  —  1  have  been  verified.  To  verify  p,  we  compute 
rp_1  mod  p  from  r  and  p  as  well  as  r^p~1^q  mod  p  for  each  of  the  prime  divisors  q  of 
p  —  1  and  compare  the  results  with  1.  The  integers  (p  —  1  )/q  can  be  computed  through 
subtraction  of  n-bit  numbers  in  0(n 2)  steps  on  a  RAM.  To  raise  r  to  an  exponent  e,  rep¬ 
resent  e  as  a  binary  number.  For  example,  if  e  =  7,  write  it  as  p  =  22  +  21  +  2°.  If  t 
is  the  largest  such  power  of  2 ,  t  <  log2(p  —  1)  <  n.  Compute  r2°  mod  p  by  squaring 
r  j  times,  each  time  reducing  it  by  p  through  division.  Since  each  squaring/reduction  step 
takes  0{n2)  RAM  steps,  at  most  0(jn2)  RAM  steps  are  required  to  compute  r23 .  Since 
this  may  be  done  for  2<j<t  and  ^2 2 <  -<t  j  =  0(t2),  at  most  0(n3)  RAM  steps  suffice 
to  compute  one  of  rp~l  mod  p  or  mod  p  for  a  prime  divisor  q.  Since  there  are  at 

most  n  of  these  quantities  to  compute,  0(nA)  RAM  steps  suffice  to  compute  them. 

To  complete  the  verification  of  the  prime  p,  we  also  need  to  verify  the  divisors  q\, ...  ,qk 
of  p—  1 .  We  take  as  our  inductive  hypothesis  that  an  arbitrary  prime  q  of  n  bits  can  be  veri¬ 
fied  in  0(n5)  steps.  Since  the  sum  of  the  number  of  bits  in  q2, . . . ,  qk  is  (log2(p—  1  )/2  —  1) 
and  the  sum  of  the  fcth  powers  is  no  more  than  the  fcth  power  of  the  sum,  it  follows  that 
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0(n5)  RAM  steps  suffice  to  verify  p.  Since  a  polynomial  number  of  RAM  steps  can  be 
executed  in  a  polynomial  number  of  Turing  machine  steps,  PRIMALITY  is  in  NP.  ■ 

Since  NP  H  coNP  C  NP  and  NP  [~l  coNP  C  coNP  as  well  as  NP  C  NP  U  coNP  and 

coNP  C  NP  U  coNP,  we  begin  to  have  the  makings  of  a  hierarchy.  If  we  add  that  coNP 
C  PSPACE  (see  Problem  8.13),  we  have  the  relationships  between  complexity  classes  shown 
schematically  in  Fig.  8.7. 


8.7  Reductions 

In  this  section  we  specialize  the  reductions  introduced  in  Section  2.4  and  use  them  to  classify 
problems  into  categories.  We  show  that  if  problem  A  is  reduced  to  problem  B  by  a  function 
in  the  set  R  and  A  is  hard  relative  to  R,  then  B  cannot  be  easy  relative  to  R  because  A  can 
be  solved  easily  by  reducing  it  to  B  and  solving  B  with  an  easy  algorithm,  contradicting  the 
fact  that  A  is  hard.  On  the  other  hand,  if  B  is  easy  to  solve  relative  to  R,  then  A  must  be 
easy  to  solve.  Thus,  reductions  can  be  used  to  show  that  some  problems  are  hard  or  easy.  Also, 
if  A  can  be  reduced  to  B  by  a  function  in  R  and  vice  versa,  then  A  and  B  have  the  same 
complexity  relative  to  R. 

Reductions  are  widely  used  in  computer  science;  we  use  them  whenever  we  specialize  one 
procedure  to  realize  another.  Thus,  reductions  in  the  form  of  simulations  are  used  throughout 
Chapter  3  to  exhibit  circuits  that  compute  the  same  functions  that  are  computed  by  finite- 
state,  random-access,  and  Turing  machines,  with  and  without  nondeterminism.  Simulations 
prove  to  be  an  important  type  of  reduction.  Similarly,  in  Chapter  10  we  use  simulation  to  show 
that  any  computation  done  in  the  pebble  game  can  be  simulated  by  a  branching  program. 

Not  only  did  we  simulate  machines  with  memory  by  circuits  in  Chapter  3,  but  we  demon¬ 
strated  in  Sections  3.9.5  and  3.9.6  that  the  languages  CIRCUIT  VALUE  and  CIRCUIT  SAT 
describing  circuits  are  P-complete  and  NP-complete,  respectively.  We  demonstrated  that  each 
string  x  in  an  arbitrary  language  in  P  (NP)  could  be  translated  into  a  string  in  CIRCUIT  VALUE 
(respectively,  CIRCUIT  SAT)  by  a  program  whose  running  time  is  polynomial  in  the  length  of 
x  and  whose  space  is  logarithmic  in  its  length. 

In  this  chapter  we  extend  these  results.  We  consider  primarily  transformations  (also  called 
many-one  reductions  and  just  reductions  in  Section  5.8.1),  a  type  of  reduction  in  which  an 
instance  of  one  decision  problem  is  translated  to  an  instance  of  a  second  problem  such  that  the 
former  is  a  “yes”  instance  if  and  only  if  the  latter  is  a  “yes”  instance.  A  Turing  reduction  is  a 
second  type  of  reduction  that  is  defined  by  an  oracle  Turing  machine.  (See  Section  8.4.2  and 
Problem  8.15.)  In  this  case  the  Turing  machine  may  make  more  than  one  call  to  the  second 
problem  (the  oracle).  A  transformation  is  equivalent  to  an  oracle  Turing  reduction  that  makes 
one  call  to  the  oracle.  Turing  reductions  subsume  all  previous  reductions  used  elsewhere  in  this 
book.  (See  Problems  8.15  and  8.16.)  However,  since  the  results  of  this  section  can  be  derived 
with  the  weaker  transformations,  we  limit  our  attention  to  them. 

DEFINITION  8.7.1  If  L\  and  L2  are  languages,  a  transformation  h  from  L\  to  L2  is  a  DTM- 
computable  function  h  :  B*  1— >  B*  such  that  x  £  L\  if  and  only  if  h(x)  £  L2 ■  A  resource- 
bounded  transformation  is  a  transformation  that  is  computed  under  a  resource  bound  such  as 
deterministic  logarithmic  space  or  polynomial  time. 
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The  classification  of  problems  is  simplified  by  considering  classes  of  transformations.  These 
classes  will  be  determined  by  bounds  on  resources  such  as  space  and  time  on  a  Turing  machine 
or  circuit  size  and  depth. 

DEFINITION  8.7.2  For  decision  problemsV\  andVi,  the  notationV\  <r  'P2  means  thatV\  can 
be  transformed  to  V2  by  a  transformation  in  the  class  R. 

Compatibility  among  transformation  classes  and  complexity  classes  helps  determine  con¬ 
ditions  under  which  problems  are  hard. 

DEFINITION  8.7.3  Let  C  be  a  complexity  class,  R  a  class  of  resource-bounded  transformations,  and 
V\  and  V2  decision  problems.  A  set  of  transformations  R  is  compatible  with  C  if  V\  <»  Pi 
and  V2  €  C,  then  Vx  €  C. 

It  is  easy  to  see  that  the  polynomial-time  transformations  (denoted  <p)  are  compatible 
with  P.  (See  Problem  8.17.)  Also  compatible  with  P  are  the  log-space  transformations  (de¬ 
noted  <iog-space)  associated  with  transformations  that  can  be  computed  in  logarithmic  space. 
Log-space  transformations  are  also  polynomial  transformations,  as  shown  in  Theorem  8.5.8. 

8.8  Hard  and  Complete  Problems 

Classes  of  problems  are  defined  above  by  their  use  of  space  and  time.  We  now  set  the  stage  for 
the  identification  of  problems  that  are  hard  relative  to  members  of  these  classes.  A  few  more 
definitions  are  needed  before  we  begin  this  task. 

DEFINITION  8.8.1  A  class  R  of  transformations  is  transitive  if  the  composition  of  any  two  trans¬ 
formations  in  R  is  also  in  R  and  for  all  problems  V\,  V2,  and  V},  V\  <r  V2  and  V2  <r 
implies  that  V\  <rV^. 

If  a  class  R  of  transformations  is  transitive,  then  we  can  compose  any  two  transformations 
in  the  class  and  obtain  another  transformation  in  the  class.  Transitivity  is  used  to  define  hard 
and  complete  problems. 

The  transformations  <p  and  <iog-Space  described  above  are  transitive.  Below  we  show 
that  <  log-space  is  transitive  and  leave  to  the  reader  the  proof  of  transitivity  of  —P  and  the 
polynomial-time  Turing  reductions.  (See  Problem  8.19.) 

THEOREM  8.8. 1  Log-space  transformations  are  transitive. 

Proof  A  log-space  transformation  is  a  DTM  that  has  a  read-only  input  tape,  a  write-only 
output  tape,  and  a  work  tape  or  tapes  on  which  it  uses  0(log  n )  cells  to  process  an  input 
string  w  of  length  n.  As  shown  in  Theorem  8.5.8,  such  DTMs  halt  within  polynomial  time. 
We  now  design  a  machine  T  that  composes  two  log-space  transformations  in  logarithmic 
space.  (See  Fig.  8.10.) 

Let  Mi  and  M2  denote  the  first  and  second  log-space  DTMs.  When  Mi  and  M2  are 
composed  to  form  T,  the  output  tape  of  M\,  which  is  also  the  input  tape  of  M2,  becomes 
a  work  tape  of  T.  Since  M\  may  execute  a  polynomial  number  of  steps,  we  cannot  store  all 
its  output  before  beginning  the  computation  by  M2.  Instead  we  must  be  more  clever.  We 
keep  the  contents  of  the  work  tapes  of  both  machines  as  well  as  (and  this  is  where  we  are 
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The  composition  of  two  deterministic  log-space  Turing  machines. 


clever)  an  integer  h\  recording  the  position  of  the  input  head  of  M2  on  the  output  tape  of 
Mi .  If  M2  moves  its  input  head  right  by  one  step,  M\  is  simulated  until  one  more  output 
is  produced.  If  its  head  moves  left,  we  decrement  h\,  restart  M\,  and  simulate  it  until  h\ 
outputs  are  produced  and  then  supply  this  output  as  an  input  to  M2. 

The  space  used  by  this  simulation  is  the  space  used  by  M\  and  M2  plus  the  space  for 
hi,  the  value  under  the  input  head  of  M2  and  some  temporary  space.  The  total  space  is 
logarithmic  in  n  since  hi  is  at  most  a  polynomial  in  n.  ■ 

We  now  apply  transitivity  of  reductions  to  define  hard  and  complete  problems. 

DEFINITION  8.8.2  Let  R  be  a  class  of  reductions,  let  C  he  a  complexity  class,  and  let  R  be  com¬ 
patible  with  C.  A  problem  Q  is  hard  for  C  under  f?-reductions  if  for  every  problem  V  G  C, 
V  <  /{  Q.  A  problem  Q  is  complete  for  C  under  f?-reductions  if  it  is  hard  for  C  under 
R-reductions  and  is  a  member  of  C. 

Problems  are  hard  for  a  class  if  they  are  as  hard  to  solve  as  any  other  problem  in  the  class. 
Sometimes  problems  are  shown  hard  for  a  class  without  showing  that  they  are  members  of  that 
class.  Complete  problems  are  members  of  the  class  for  which  they  are  hard.  Thus,  complete 
problems  are  the  hardest  problems  in  the  class.  We  now  define  three  important  classes  of 
complete  problems. 
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DEFINITION  8.8.3  Problems  in  P  that  are  hard  for  P  under  log-space  reductions  are  called  P- 
complete.  Problems  in  NP  that  are  hard for  NP  under  polynomial-time  reductions  are  called  NP- 
complete.  Problems  in  PSPACE  that  are  hard for  PSPACE  under  polynomial-time  reductions  are 
called  PSPACE-complete. 

We  state  Theorem  8.8.2,  which  follows  directly  from  Definition  8.7.3  and  transitivity  of 
log-space  and  polynomial-time  reductions,  because  it  incorporates  as  conditions  the  goals  of 
the  study  of  P-complete,  NP-complete,  and  PSPACE-complete  problems,  namely,  to  show 
that  all  problems  in  P  can  be  solved  in  log-space  and  all  problems  in  NP  and  PSPACE  can  be 
solved  in  polynomial  time.  It  is  unlikely  that  any  of  these  goals  can  be  reached. 

THEOREM  8.8.2  If  a  P  -complete  problem  can  be  solved  in  log-space,  then  all  problems  in  P  can 
be  solved  in  log-space.  If  an  NP  -complete  problem  is  in  P,  then  P  =  NP.  If  a  PSPACE -complete 
problem  is  in  P,  then  P  =  PSPACE. 

In  Theorem  8. 14.2  we  show  that  if  a  P-complete  problem  can  be  solved  in  poly-logarithmic 
time  with  polynomially  many  processors  on  a  CREW  PRAM  (they  are  fully  parallelizable), 
then  so  can  all  problems  in  P.  It  is  considered  unlikely  that  all  languages  in  P  can  be  fully  par¬ 
allelized.  Nonetheless,  the  question  of  the  parallelizability  of  P  is  reduced  to  deciding  whether 
P-complete  problems  are  parallelizable. 

8.9  P-Complete  Problems 

To  show  that  a  problem  V  is  P-complete  we  must  show  that  it  is  in  P  and  that  all  problems 
in  P  can  be  reduced  to  V  via  a  log-space  reduction.  (See  Section  3.9.5.)  The  task  of  showing 
this  is  simplified  by  the  knowledge  that  log-space  reductions  are  transitive:  if  another  problem 
Q  has  already  been  shown  to  be  P-complete,  to  show  that  V  is  P-complete  it  suffices  to  show 
there  is  a  log-space  reduction  from  QtoV  and  that  V  £  P. 

CIRCUIT  VALUE 

Instance:  A  circuit  description  with  fixed  values  for  its  input  variables  and  a  designated 

output  gate. 

Answer:  “Yes”  if  the  output  of  the  circuit  has  value  1 . 

In  Section  3.9.5  we  show  that  the  CIRCUIT  VALUE  problem  described  above  is  P-complete 
by  demonstrating  that  for  every  decision  problem  V  in  P  an  instance  w  of  V  and  a  DTM  M 
that  recognizes  “Yes”  instances  of  V  can  be  translated  by  a  log-space  DTM  into  an  instance  c 
of  CIRCUIT  VALUE  such  that  w  is  a  “Yes”  instance  of  V  if  and  only  if  c  is  a  “Yes”  instance  of 
CIRCUIT  VALUE. 

Since  P  is  closed  under  complements  (see  Theorem  8.6.1),  it  follows  that  if  the  “Yes”  in¬ 
stances  of  a  decision  problem  can  be  determined  in  polynomial  time,  so  can  the  “No”  instances. 
Thus,  the  CIRCUIT  VALUE  problem  is  equivalent  to  determining  the  value  of  a  circuit  from  its 
description.  Note  that  for  CIRCUIT  VALUE  the  values  of  all  variables  of  a  circuit  are  included 
in  its  description. 

CIRCUIT  VALUE  is  in  P  because,  as  shown  in  Theorem  8.13.2,  a  circuit  can  be  evaluated 
in  a  number  of  steps  proportional  at  worst  to  the  square  of  the  length  of  its  description.  Thus, 
an  instance  of  CIRCUIT  VALUE  can  be  evaluated  in  a  polynomial  number  of  steps. 
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Monotone  circuits  are  constructed  of  AND  and  OR  gates.  The  functions  computed  by 
monotone  circuits  form  an  asymptotically  small  subset  of  the  set  of  Boolean  functions.  Also, 
many  important  Boolean  functions  are  not  monotone,  such  as  binary  addition.  But  even 
though  monotone  circuits  are  a  very  restricted  class  of  circuits,  the  monotone  version  of  CIR¬ 
CUIT  VALUE,  defined  below,  is  also  P-complete. 

MONOTONE  CIRCUIT  VALUE 

Instance:  A  description  for  a  monotone  circuit  with  fixed  values  for  its  input  variables  and 

a  designated  output  gate. 

Answer:  “Yes”  if  the  output  of  the  circuit  has  value  1 . 

CIRCUIT  VALUE  is  a  starting  point  to  show  that  many  other  problems  are  P-complete.  We 
begin  by  reducing  it  to  MONOTONE  CIRCUIT  VALUE. 

THEOREM  8.9. 1  MONOTONE  CIRCUIT  VALUE  is  V-complete. 

Proof  As  shown  in  Problem  2.12,  every  Boolean  function  can  be  realized  with  just  AND 
and  OR  gates  (this  is  known  as  dual-rail  logic)  if  the  values  of  input  variables  and  their 
complements  are  made  available.  We  reduce  an  instance  of  CIRCUIT  VALUE  to  an  instance 
of  MONOTONE  CIRCUIT  VALUE  by  replacing  each  gate  with  the  pair  of  monotone  gates 
described  in  Problem  2.12.  Such  descriptions  can  be  written  out  in  log-space  if  the  gates  in 
the  monotone  circuit  are  numbered  properly.  (See  Problem  8.20.)  The  reduction  must  also 
write  out  the  values  of  variables  of  the  original  circuit  and  their  complements.  ■ 

The  class  of  P-complete  problems  is  very  rich.  Space  limitations  require  us  to  limit  our 
treatment  of  this  subject  to  two  more  problems.  We  now  show  that  LINEAR  INEQUALITIES 
described  below  is  P-complete.  LINEAR  INEQUALITIES  is  important  because  it  is  directly  re¬ 
lated  to  LINEAR  PROGRAMMING,  which  is  widely  used  to  characterize  optimization  problems. 
The  reader  is  asked  to  show  that  LINEAR  PROGRAMMING  is  P-complete.  (See  Problem  8.21.) 

LINEAR  INEQUALITIES 

Instance:  An  integer-valued  m  x  n  matrix  A  and  column  m-vector  b. 

Answer:  “Yes”  if  there  is  a  rational  column  n-vector  x  >  0  (all  components  are  non-negative 

and  at  least  one  is  non-zero)  such  that  Ax  <  b. 

We  show  that  LINEAR  INEQUALITIES  is  P-hard,  that  is,  that  every  problem  in  P  can  be 
reduced  to  it  in  log-space.  The  proof  that  LINEAR  INEQUALITIES  is  in  P,  an  important  and 
difficult  result  in  its  own  right,  is  not  given  here.  (See  [165].) 

THEOREM  8.9.2  LINEAR  INEQUALITIES  is  V-bard. 

Proof  We  give  a  log-space  reduction  of  CIRCUIT  VALUE  to  LINEAR  INEQUALITIES.  That 
is,  we  show  that  in  log-space  an  instance  of  CIRCUIT  VALUE  can  be  transformed  to  an  in¬ 
stance  of  LINEAR  INEQUALITIES  so  that  an  instance  of  CIRCUIT  VALUE  is  a  “Yes”  instance 
if  and  only  if  the  corresponding  instance  of  LINEAR  INEQUALITIES  is  a  “Yes”  instance. 

The  log-space  reduction  that  we  use  converts  each  gate  and  input  in  an  instance  of  a 
circuit  into  a  set  of  inequalities.  The  inequalities  describing  each  gate  are  shown  below.  (An 
equality  relation  a  =  b  is  equivalent  to  two  inequality  relations,  a  <  b  and  b  <  a.)  The 
reduction  also  writes  the  equality  z  =  1  for  the  output  gate  z.  Since  each  variable  must 
be  non-negative,  this  last  condition  insures  that  the  resulting  vector  of  variables,  x,  satisfies 
x  >  0. 
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Input 

Gates 

Type 

TRUE 

FALSE 

NOT 

AND 

OR 

Function 

Xi  =  1 

Xi  =  0 

W  =  -1  u 

w  =  u  Av 

w  =  uV  v 

Inequalities 

Xi  =  1 

Xi  =  0 

0<w<l 

w  =  1  —  u 

0<w<l 

w  <  u 

W  <  V 

u  +  v  —  1  <  w 

0  <  w  <  1 

u  <  w 

V  <  w 

W  <  It  +  V 

Given  an  instance  of  CIRCUIT  VALUE,  each  assignment  to  a  variable  is  translated  into 
an  equality  statement  of  the  form  Xi  =  0  or  Xi  =  1.  Similarly,  each  AND,  OR,  and  NOT 
gate  is  translated  into  a  set  of  inequalities  of  the  form  shown  above.  Logarithmic  temporary 
space  suffices  to  hold  gate  numbers  and  to  write  these  inequalities  because  the  number  of 
bits  needed  to  represent  each  gate  number  is  logarithmic  in  the  length  of  an  instance  of 
CIRCUIT  VALUE. 

To  see  that  an  instance  of  CIRCUIT  VALUE  is  a  “Yes”  instance  if  and  only  if  the  instance 
of  LINEAR  INEQUALITIES  is  also  a  “Yes”  instance,  observe  that  inputs  of  0  or  1  to  a  gate 
result  in  the  correct  output  if  and  only  if  the  corresponding  set  of  inequalities  forces  the 
output  variable  to  have  the  same  value.  By  induction  on  the  size  of  the  circuit  instance,  the 
values  computed  by  each  gate  are  exactly  the  same  as  the  values  of  the  corresponding  output 
variables  in  the  set  of  inequalities.  ■ 

We  give  as  our  last  example  of  a  P-complete  problem  DTM  ACCEPTANCE,  the  problem 
of  deciding  if  a  string  is  accepted  by  a  deterministic  Turing  machine  in  a  number  of  steps 
specified  as  a  unary  number.  (The  integer  k  is  represented  as  a  unary  number  by  a  string  of  k 
characters.)  For  this  problem  it  is  more  convenient  to  give  a  direct  reduction  from  all  problems 
in  P  to  DTM  ACCEPTANCE. 

DTM  ACCEPTANCE 

Instance:  A  description  of  a  DTM  M,  a  string  w,  and  an  integer  n  written  in  unary. 

Answer:  “Yes”  if  and  only  if  M,  when  started  with  input  w,  halts  with  the  answer  “Yes”  in 

at  most  n  steps. 

THEOREM  8.9.3  DTM  ACCEPTANCE  isV-complete. 

Proof  To  show  that  DTM  ACCEPTANCE  is  log-space  complete  for  P,  consider  an  arbitrary 
problem  V  in  P  and  an  arbitrary  instance  of  “P,  namely  x.  There  is  some  Turing  machine, 
say  Mp ,  that  accepts  instances  x  of  V  of  length  n  in  time  p(n),p  a  polynomial.  We  assume 
thatp  is  included  with  the  specification  of  Mp.  For  example,  if p(y)  =  2 y4  +  3 y2  +  1,  we 
can  represent  it  with  the  string  ((2, 4),  (3,  2),  (1,  0)).  The  log-space  Turing  machine  that 
translates  Mp  and  x  into  an  instance  of  DTM  ACCEPTANCE  writes  the  description  of  Mp 
together  with  the  input  x  and  the  value  of  p(n)  in  unary.  Constant  temporary  space  suffices 
to  move  the  descriptions  of  Mp  and  x  to  the  output  tape.  To  complete  the  proof  we  need 
only  show  that  0(log  n)  temporary  space  suffices  to  write  the  value  in  p(n)  in  unary,  where 
n  is  the  length  of  x. 
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Since  the  length  of  the  input  x  is  provided  in  unary,  that  is,  by  the  number  of  characters 
it  contains,  its  length  n  can  be  written  in  binary  on  a  work  tape  in  space  O(logn)  by 
counting  the  number  of  characters  in  x.  Since  it  is  not  difficult  to  show  that  any  power  of 
a  fc-bit  binary  number  can  be  computed  by  a  DTM  in  work  space  0(k),  it  follows  that  any 
fixed  polynomial  in  n  can  be  computed  by  a  DTM  in  work  space  O(k)  =  O(logn).  (See 
Problem  8.18.) 

To  show  that  DTM  ACCEPTANCE  is  in  P,  we  design  a  Turing  machine  that  accepts  the 
“Yes”  instances  in  polynomial  time.  This  machine  copies  the  unary  string  of  length  n  to  one 
of  its  work  tapes.  Given  the  description  of  the  DTM  M,  it  simulates  M  with  a  universal 
Turing  machine  on  input  w.  When  it  completes  a  step,  it  advances  the  head  on  the  work 
tape  containing  n  in  unary,  declaring  the  instance  of  DTM  ACCEPTANCE  accepted  if  M 
terminates  without  using  more  than  n  steps.  By  definition,  it  will  complete  its  simulation  of 
M  in  at  most  n  of  M’s  steps  each  of  which  uses  a  constant  number  of  steps  on  the  simulating 
machine.  That  is,  it  accepts  a  “Yes”  instance  of  DTM  ACCEPTANCE  in  time  polynomial  in 
the  length  of  the  input.  ■ 


8.10  NP-Complete  Problems 

As  mentioned  above,  the  NP-complete  problems  are  the  problems  in  NP  that  are  the  most 
difficult  to  solve.  We  have  shown  that  NP  C  PSPACE  C  EXPTIME  or  that  every  problem  in 
NP,  including  the  NP-complete  problems,  can  be  solved  in  exponential  time.  Since  the  NP- 
complete  problems  are  the  hardest  problems  in  NP,  each  of  these  is  at  worst  an  exponential¬ 
time  problem.  Thus,  we  know  that  the  NP-complete  problems  require  either  polynomial  or 
exponential  time,  but  we  don’t  know  which. 

The  CIRCUIT  SAT  problem  is  to  determine  from  a  description  of  a  circuit  whether  it  can 
be  satisfied;  that  is,  whether  values  can  be  assigned  to  its  inputs  such  that  the  circuit  output 
has  value  1 .  As  mentioned  above,  this  is  our  canonical  NP-complete  problem. 

CIRCUIT  SAT 

Instance:  A  circuit  description  with  n  input  variables  {x\,  X2,  ■  ■  ■ ,  £n}  for  some  integer  n 

and  a  designated  output  gate. 

Answer:  “Yes”  if  there  is  an  assignment  of  values  to  the  variables  such  that  the  output  of  the 

circuit  has  value  1 . 

As  shown  in  Section  3.9.6,  CIRCUIT  SAT  is  an  NP  -complete  problem.  The  goal  of  this 
problem  is  to  recognize  the  “Yes”  instances  of  CIRCUIT  SAT,  instances  for  which  there  are 
values  for  the  input  variables  such  that  the  circuit  has  value  1 . 

In  Section  3.9.6  we  showed  that  CIRCUIT  SAT  described  above  is  NP  -complete  by  demon¬ 
strating  that  for  every  decision  problem  V  in  NP  an  instance  w  of  V  and  an  NDTM  M  that 
accepts  “Yes”  instances  of  V  can  be  translated  by  a  polynomial-time  (actually,  a  log-space) 
DTM  into  an  instance  c  of  CIRCUIT  SAT  such  that  tit  is  a  “Yes”  instance  of  V  if  and  only  if  c 
is  a  “Yes”  instance  of  CIRCUIT  SAT. 

Although  it  suffices  to  reduce  problems  in  NP  via  a  polynomial-time  transformation  to  an 
NP  -complete  problem,  each  of  the  reductions  given  in  this  chapter  can  be  done  by  a  log-space 
transformation.  We  now  show  that  a  variety  of  other  problems  are  NP-complete. 


356 


Chapter  8  Complexity  Classes 


Models  of  Computation 


8.10.1  NP-Complete  Satisfiability  Problems 

In  Section  3.9.6  we  showed  that  SATISFIABILITY  defined  below  is  NP  -complete.  In  this  sec¬ 
tion  we  demonstrate  that  two  variants  of  this  language  are  NP-complete  by  simple  extensions 
of  the  basic  proof  that  CIRCUIT  SAT  is  NP-complete. 

SATISFIABILITY 

Instance:  A  set  of  literals  X  =  {xi,X\,  X2,  X2,  ...,  xn,xn}  and  a  sequence  of  clauses 
C  =  (ci,  C2,  ■ . . ,  Cm),  where  each  clause  Cj  is  a  subset  of  X. 

Answer:  “Yes”  if  there  is  a  (satisfying)  assignment  of  values  for  the  variables  {x\,Xz,  ■  ■  ■ , 
xn}  over  the  set  B  such  that  each  clause  has  at  least  one  literal  whose  value  is  1. 

The  two  variants  of  SATISFIABILITY  are  3-SAT,  which  has  at  most  three  literals  in  each 
clause,  and  NAESAT,  in  which  not  all  literals  in  each  clause  have  the  same  value. 

3-SAT 

Instance:  A  set  of  literals  X  =  {x\,X\,  X2,X2,  ■  ■  ■ ,  xn,xn},  and  a  sequence  of  clauses 
C  =  (ci,  C2, . . . ,  Cm),  where  each  clause  Ci  is  a  subset  of  X  containing  at  most  three 
literals. 

Answer:  “Yes”  if  there  is  an  assignment  of  values  for  variables  {xi,X2,  ■  ■  ■ ,  xn}  over  the  set 
B  such  that  each  clause  has  at  least  one  literal  whose  value  is  1 . 

THEOREM  8. 1  0. 1  3-SAT  is  lAV-complete. 

Proof  The  proof  that  SATISFIABILITY  is  NP-complete  also  applies  to  3-SAT  because  each 
of  the  clauses  produced  in  the  transformation  of  instances  of  CIRCUIT  SAT  has  at  most  three 
literals  per  clause.  ■ 

NAESAT 

Instance:  An  instance  of  3-SAT. 

Answer:  “Yes”  if  each  clause  is  satisfiable  when  not  all  literals  have  the  same  value. 

NAESAT  contains  as  its  “Yes”  instances  those  instances  of  3-SAT  in  which  the  literals  in 
each  clause  are  not  all  equal. 

THEOREM  8. 1  0.2  NAESAT  is  NV -complete. 

Proof  We  reduce  CIRCUIT  SAT  to  NAESAT  using  almost  the  same  reduction  as  for  3-SAT. 
Each  gate  is  replaced  by  a  set  of  clauses.  (See  Fig.  8.11.)  The  only  difference  is  that  we 
add  the  new  literal  y  to  each  two-literal  clause  associated  with  AND  and  OR  gates  and  to 
the  clause  associated  with  the  output  gate.  Clearly,  this  reduction  can  be  performed  in  de¬ 
terministic  log-space.  Since  a  “Yes”  instance  of  NAESAT  can  be  verified  in  nondeterministic 
polynomial  time,  NAESAT  is  in  NP.  We  now  show  that  it  is  NP-hard. 

Given  a  “Yes”  instance  of  CIRCUIT  SAT,  we  show  that  the  instance  of  3-SAT  is  a  “Yes” 
instance.  Since  every  clause  is  satisfied  in  a  “Yes”  instance  of  CIRCUIT  SAT,  every  clause  of 
the  corresponding  instance  of  NAESAT  has  at  least  one  literal  with  value  1.  The  clauses  that 
don’t  contain  the  literal  y  by  their  nature  have  not  all  literals  equal.  Those  containing  y  can 
be  made  to  satisfy  this  condition  by  setting  y  to  0,  thereby  providing  a  “Yes”  instance  of 
NAESAT. 

Now  consider  a  “Yes”  instance  of  NAESAT  produced  by  the  mapping  from  CIRCUIT 
SAT.  Replacing  every  literal  by  its  complement  generates  another  “Yes”  instance  of  NAESAT 
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Step  Typ 

e 

Corresponding  Clauses 

(i 

READ 

x) 

C Vi  v  x ) 

(, 9i  v  x) 

(* 

NOT 

j) 

(Vi  V  9j) 

©  V  g j) 

(* 

OR 

j 

k) 

( 9i  dVjdy) 

( 9i  dgkdy) 

(. 9i  V  9j  v  gk) 

(* 

AND 

j 

k) 

(Vi  V  V  y) 

0 Vi  v  gk  v  y) 

(9i  V  Vj  V  Vk ) 

(* 

OUTPUT 

j) 

( 9j  V  y) 

Figure  8. 1  I  A  reduction  from  CIRCUIT  SAT  to  NAESAT  is  obtained  by  replacing  each  gate 
in  a  “Yes”  instance  of  CIRCUIT  SAT  by  a  set  of  clauses.  The  clauses  used  in  the  reduction  from 
CIRCUIT  SAT  to  3-SAT  (see  Section  3.9.6)  are  those  shown  above  with  the  literal  y  removed.  In 
the  reduction  to  NAESAT  the  literal  y  is  added  to  the  2-literal  clauses  used  for  AND  and  OR  gates 
and  to  the  output  clause. 


since  the  literals  in  each  clause  are  not  all  equal,  a  property  that  applies  before  and  after 
complementation.  In  one  of  these  “Yes”  instances  y  is  assigned  the  value  0.  Because  this  is  a 
“Yes”  instance  of  NAESAT,  at  least  one  literal  in  each  clause  has  value  1;  that  is,  each  clause 
is  satisfiable.  This  implies  that  the  original  CIRCUIT  SAT  problem  is  satisfiable.  It  follows 
that  an  instance  of  CIRCUIT  SAT  has  been  translated  into  an  instance  of  NAESAT  so  that  the 
former  is  a  “Yes”  instance  if  and  only  if  the  latter  is  a  “Yes”  instance.  ■ 

8.10.2  Other  NP-Complete  Problems 

This  section  gives  a  sampling  of  additional  NP-complete  problems.  Following  the  format  of 
the  previous  section,  we  present  each  problem  and  then  give  a  proof  that  it  is  NP-complete. 
Each  proof  includes  a  reduction  of  a  problem  previously  shown  NP-complete  to  the  current 
problem.  The  succession  of  reductions  developed  in  this  book  is  shown  in  Fig.  8.12. 

INDEPENDENT  SET 

Instance:  A  graph  G  =  (V,  E)  and  an  integer  k. 

Answer:  “Yes”  if  there  is  a  set  of  k  vertices  of  G  such  that  there  is  no  edge  in  E  between 

them. 

THEOREM  8. 1  0.3  INDEPENDENT  SET  is  NP-complete. 

Proof  INDEPENDENT  SET  is  in  NP  because  an  NDTM  can  propose  and  then  verify  in 
polynomial  time  a  set  of  k  independent  vertices.  We  show  that  INDEPENDENT  SET  is  NP- 
hard  by  reducing  3-SAT  to  it.  We  begin  by  showing  that  a  restricted  version  of  3-SAT,  one 
in  which  each  clause  contains  exactly  three  literals,  is  also  NP-complete.  If  for  some  variable 
x,  both  x  and  x  are  in  the  same  clause,  we  eliminate  the  clause  since  it  is  always  satisfied. 
Second,  we  replace  each  2-literal  clause  (a  V  b)  with  the  two  3-literal  clauses  (aVfeVz)  and 
(a  V  b  V  z),  where  z  is  a  new  variable.  Since  z  is  either  0  or  1,  if  all  clauses  are  satisfied  then 
(a  V  b)  has  value  1  in  both  causes.  Similarly,  a  clause  with  a  single  literal  can  be  transformed 
to  one  containing  three  literals  by  introducing  two  new  variables  and  replacing  the  clause 
containing  the  single  literal  with  four  clauses  each  containing  three  literals.  Since  adding 
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distinct  new  variables  to  each  clause  that  contains  fewer  than  three  literals  can  be  done  in 
log-space,  this  new  problem,  which  we  also  call  3-SAT,  is  also  NP-complete. 

We  now  construct  an  instance  of  INDEPENDENT  SET  from  this  new  version  of  3-SAT 
in  which  k  is  equal  to  the  number  of  clauses.  (See  Fig.  8.13.)  Its  graph  G  has  one  triangle 
for  each  clause  and  vertices  carry  the  names  of  the  three  literals  in  a  clause.  G  also  has  an 
edge  between  vertices  carrying  the  labels  of  complementary  literals. 

Consider  a  “Yes”  instance  of  3-SAT.  Pick  one  literal  with  value  1  from  each  clause. 
This  identifies  k  vertices,  one  per  triangle,  and  no  edge  exists  between  these  vertices.  Thus, 
the  instance  of  INDEPENDENT  SET  is  a  “Yes”  instance.  Conversely,  a  “Yes”  instance  of 
INDEPENDENT  SET  on  G  has  k  vertices,  one  per  triangle,  and  no  two  vertices  carry  the 
label  of  a  variable  and  its  complement  because  all  such  vertices  have  an  edge  between  them. 
The  literals  associated  with  these  independent  vertices  are  assigned  value  1,  causing  each 
clause  to  be  satisfied.  Variables  not  so  identified  are  assigned  arbitrary  values.  ■ 


Figure  8.13  A  graph  for  an  instance  of  INDEPENDENT  SET  constructed  from  the  following 
instance  of  3-SAT:  (x\  V  x2  V  x3)  A  ( X\  Vs2V  x})  A  (xt  V  x2  V  x3). 
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3-coloring 

Instance:  The  description  of  a  graph  G  =  ( V ,  E). 

Answer:  “Yes”  if  there  is  an  assignment  of  three  colors  to  vertices  such  that  adjacent  vertices 
have  different  colors. 

THEOREM  8. 1  0.4  3-COLORING  is  N V-complete . 

Proof  To  show  that  3-COLORING  is  in  NP,  observe  that  a  three-coloring  of  a  graph  can 
be  proposed  in  nondeterministic  polynomial  time  and  verified  in  deterministic  polynomial 
time. 

We  reduce  NAESAT  to  3-COLORING.  Recall  that  an  instance  of  NAESAT  is  an  instance 
of  3-SAT.  A  “Yes”  instance  of  NAESAT  is  one  for  which  each  clause  is  satisfiable  with  not 
all  literals  equal.  Let  an  instance  of  NAESAT  consist  of  m  clauses  C  =  [c\,  Ci, . . . ,  cm) 
containing  exactly  three  literals  from  the  set  X  =  {xi,xi,  X2,Xi,  ■  ■  ■  ,xn,xn}  of  literals  in 
n  variables.  (Use  the  technique  introduced  in  the  proof  of  Theorem  8.10.3  to  insure  that 
each  clause  in  an  instance  of  3-SAT  has  exactly  three  literals  per  clause.) 

Given  an  instance  of  NAESAT,  we  construct  a  graph  G  in  log-space  and  show  that  this 
graph  is  three-colorable  if  and  only  if  the  instance  of  NAESAT  is  a  “Yes”  instance. 

The  graph  G  has  a  set  of  n  variable  triangles,  one  per  variable.  The  vertices  of  the 
triangle  associated  with  variable  Xj  are  {v,  Xj,Xi}.  (See  Fig.  8.14.)  Thus,  all  the  variable 
triangles  have  one  vertex  in  common.  For  each  clause  containing  three  literals  we  construct 
one  clause  triangle  per  clause.  If  clause  Cj  contains  literals  \jt ,  A j2,  and  A j3,  its  associated 
clause  triangle  has  vertices  labeled  (j,  A ©,  (j,  A j2),  and  (j,  A y3).  Finally,  we  add  an  edge 
between  the  vertex  ( j ,  A  jk )  and  the  vertex  associated  with  the  literal  A jk . 

We  now  show  that  an  instance  of  NAESAT  is  a  “Yes”  instance  if  and  only  if  the  graph  G 
is  three-colorable.  Suppose  the  graph  is  three-colorable  and  the  colors  are  {0, 1,2}.  Since 


Figure  8.14  A  graph  G  corresponding  to  the  clauses  C\  =  {xi,Xi,Xi}  and  C2  =  {x\,X2,x$} 
in  an  instance  of  NAESAT.  It  has  one  variable  triangle  for  each  variable  and  one  clause  triangle  for 
each  clause. 
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three  colors  are  needed  to  color  the  vertices  of  a  triangle  and  the  variable  triangles  have  a 
vertex  labeled  v  in  common,  assume  without  loss  of  generality  that  this  common  vertex  has 
color  2.  The  other  two  vertices  in  each  variable  triangle  are  assigned  value  0  or  1,  values  we 
give  to  the  associated  variable  and  its  complement. 

Consider  now  the  coloring  of  clause  triangles.  Since  three  colors  are  needed  to  color 
vertices  of  a  clause  triangle,  consider  vertices  with  colors  0  and  1 .  The  edges  between  these 
clause  vertices  and  the  corresponding  vertices  in  variable  triangles  have  different  colors  at 
each  end.  Let  the  literals  in  the  clause  triangles  be  given  values  that  are  the  Boolean  comple¬ 
ment  of  their  colors.  This  provides  values  for  literals  that  are  consistent  with  the  values  of 
variables  and  insures  that  not  all  literals  in  a  clause  have  the  same  value.  The  third  vertex  in 
each  triangle  has  color  2.  Give  its  literal  a  value  consistent  with  the  value  of  its  variable.  It 
follows  that  the  clauses  are  a  “Yes”  instance  of  NAESAT. 

Suppose,  on  the  other  hand,  that  a  set  of  clauses  is  a  “Yes”  instance  of  NAESAT.  We 
show  that  the  graph  G  is  three-colorable.  Assign  color  2  to  vertex  v  and  colors  0  and  1  to 
vertices  labeled  Xi  and  Xi  based  on  the  values  of  these  literals  in  the  “Yes”  instance.  Consider 
two  literals  in  clause  Cj  that  are  not  both  satisfied.  If  Xi  ( Xi )  is  one  of  these,  give  the  vertex 
labeled  ( j ,  Xi)  ((j,  Xi))  the  value  that  is  the  Boolean  complement  of  the  color  of  Xi  ( x2 )  in 
its  variable  triangle.  Do  the  same  for  the  other  literal.  Since  the  third  literal  has  the  same 
value  as  one  of  the  other  two  literals  (they  have  different  values),  let  its  vertex  have  color  2. 
Then  G  is  three-colorable.  Thus,  G  is  a  “Yes”  instance  of  3-COLORING  if  and  only  if  the 
corresponding  set  of  clauses  is  a  “Yes”  instance  of  NAESAT.  ■ 

EXACT  COVER 

Instance:  A  set  S  =  {iti,  112,  . . . ,  up}  and  a  family  {Si,  S2,  •  ■  • ,  S„}  of  subsets  of  S. 

Answer:  “Yes”  if  there  are  disjoint  subsets  Sj1,  Sj2, . . . ,  Sjt  such  that  Ui <%<tSji  =  S. 

THEOREM  8. 1  0.5  EXACT  COVER  is  NP -complete. 

Proof  It  is  straightforward  to  show  that  EXACT  COVER  is  in  NP.  An  NDTM  can  simply 
select  the  subsets  and  then  verify  in  time  polynomial  in  the  length  of  the  input  that  these 
subsets  are  disjoint  and  that  they  cover  the  set  S. 

We  now  give  a  log-space  reduction  from  3-COLORING  to  EXACT  COVER.  Given  an 
instance  of  3-COLORING,  that  is,  a  graph  G  =  (V,  E ),  we  construct  an  instance  of  EXACT 
COVER,  namely,  a  set  S  and  a  family  of  subsets  of  S  such  that  G  is  a  “Yes”  instance  of 
3-COLORING  if  and  only  if  the  family  of  sets  is  a  “Yes”  instance  of  EXACT  COVER. 

As  the  set  S  we  choose  S  =  F  U  {<  e,  i  >  \e  £  E,  0  <  i  <  2}  and  as  the  family 
of  subsets  of  S  we  choose  the  sets  Sfy  and  f?e,j  defined  below  for  v  £  V,  e  £  E  and 

0  <  i  <  2: 

SVti  =  {v}  U  {<  e,  i  >  |  e  is  incident  on  v  £  V} 

Re,i  —  { V  e,i 

Let  G  be  three-colorable.  Then  let  cv,  an  integer  in  {0, 1,  2},  be  the  color  of  vertex  v. 
We  show  that  the  subsets  SVtCv  for  v  £  V  and  i?e,i  for  <  e,  i  >  (jl  Sv,Cv  for  any  v  £  V 
are  an  exact  cover.  If  e  =  (v,  w)  £  E,  then  cv  fy  cw  and  SVtCv  and  SWtCw  are  disjoint.  By 
definition  the  sets  f?6ii  are  disjoint  from  the  other  sets.  Furthermore,  every  element  of  S  is 
in  one  of  these  sets. 

On  the  other  hand,  suppose  that  S  has  an  exact  cover.  Then,  for  each  v  £  V,  there  is  a 
unique  cv,  0  <  cv  <  2,  such  that  v  £  SVtCv.  To  show  that  G  has  a  three-coloring,  assume 
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that  it  doesn’t  and  establish  a  contradiction.  Since  G  doesn’t  have  a  three-coloring,  there  is 
an  edge  e  =  (v,  w)  such  that  cv  =  cw,  which  contradicts  the  assumption  that  S  has  an 
exact  cover.  It  follows  that  G  has  a  three-coloring  if  and  only  if  S  has  an  exact  cover.  ■ 

SUBSET  SUM 

Instance:  A  set  Q  =  {ai,  a-i, . . . ,  an]  of  positive  integers  and  a  positive  integer  d. 

Answer:  “Yes”  if  there  is  a  subset  of  Q  that  adds  to  d. 

THEOREM  8. 1  0.6  SUBSET  SUM  is  NV-complete. 

Proof  SUBSET  SUM  is  in  NP  because  a  subset  can  be  nondeterministically  chosen  in  time 
equal  to  n  and  an  accepting  choice  verified  in  a  polynomial  number  of  steps  by  adding  up 
the  chosen  elements  of  the  subset  and  comparing  the  result  to  d. 

To  show  that  SUBSET  SUM  is  NP-hard,  we  give  a  log-space  reduction  of  EXACT  COVER 
to  it.  Given  an  instance  of  EXACT  COVER,  namely,  a  set  S'  =  {u\,  U2,  .  .  . ,  up]  and  a  family 
{Si,  S2, . . . ,  S„}  of  subsets  of  S,  we  construct  the  instance  of  SUBSET  SUM  characterized 
as  follows.  We  let  f3  =  n  +  1  and  d  =  Pn~l  +  /3n~ 2  +  ■  ■  ■  +  (3°  =  (/3n  -  1)  /(/?  -  1).  We 
represent  the  element  Uj  £  S  by  the  integer  /T  ,  1  <  i  <  n,  and  represent  the  set  Sj  by 
the  integer  dj  that  is  the  sum  of  the  integers  associated  with  the  elements  contained  in  Sj. 
For  example,  if p  =  n  =  3,  S\  =  {u\,U}],  S2  =  {1*1,112},  and  S3  =  {1*2},  we  represent 
Si  by  ai  =  (32  +  (3°,  S2  by  02  =  /3  +  [3°,  and  S3  by  03  =  /3.  Since  Si  and  S3  forms  an 
exact  cover  of  S,  ai  +  03  =  (31  +  (3  +  1  =  d. 

Thus,  given  an  instance  of  EXACT  COVER,  this  polynomial-time  transformation  pro¬ 
duces  an  instance  of  SUBSET  SUM.  We  now  show  that  the  instance  of  the  former  is  a  “Yes” 
instance  if  and  only  if  the  instance  of  the  latter  is  a  “Yes”  instance.  To  see  this,  observe  that 
in  adding  the  integers  corresponding  to  the  sets  in  an  EXACT  COVER  in  base  (3  there  is  no 
carry  from  one  power  of  (3  to  the  next.  Thus  the  coefficient  of  (3k  is  exactly  the  number 
of  times  that  uu+ 1  appears  in  each  of  the  sets  corresponding  to  a  set  of  subsets  of  S.  The 
subsets  form  a  “Yes”  instance  of  EXACT  COVER  exactly  when  the  corresponding  integers 
contain  each  power  of  / 3  exactly  once,  that  is,  when  the  integers  sum  to  d.  ■ 

TASK  SEQUENCING 

Instance:  Positive  integers  ■  ■  ■ ,  tr,  which  are  execution  times,  d\,  efy,  •  •  •  >  dr,  which 

are  deadlines,  P\,P2,  ■  ■  ■ , pr >  which  are  penalties,  and  integer  k  >  1 . 

Answer:  “Yes”  if  there  is  a  permutation  7r  of  {1, 2 , ,r]  such  that 

^  [  tf  ^(1)  +  +  ‘  •  •  +  f-7r (j)  then  PTr(j)  else  0]^  ^  k 

THEOREM  8. 1  0.7  TASK  SEQUENCING  is  NP^ -complete. 

Proof  TASK  SEQUENCING  is  in  NP  because  a  permutation  7r  for  a  “Yes”  instance  can  be 
verified  as  a  satisfying  permutation  in  polynomial  time.  We  now  give  a  log-space  reduction 
of  SUBSET  SUM  to  TASK  SEQUENCING. 

An  instance  of  SUBSET  SUM  is  a  positive  integer  d  and  a  set  Q  =  {ai,  02, . . . ,  an]  of 
positive  integers.  A  “Yes”  instance  is  one  such  that  a  subset  of  Q  adds  to  d.  We  translate 
an  instance  of  SUBSET  SUM  to  an  instance  of  TASK  SEQUENCING  by  setting  r  =  Tl, 
U  =  Pi  =  CLi,  di  =  d,  and  k  =  ( Jfy-  Oj)  —  d.  Consider  a  “Yes”  instance  of  this  TASK 
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SEQUENCING  problem.  Then  the  following  inequality  holds: 


f  r  \ 

if  0^(1)  +  (J7r(2)  +  •  •  •  +  a-K(j)  >  d,  then  awy)  else  0]  <  k 

V=1  / 

Let  q  be  the  expression  in  parentheses  in  the  above  inequality.  Then  q  =  aw(;+1)  +  a,r(/+ 2) 
+  •  •  •  +  an/n\,  where  l  is  the  integer  for  which  p  =  +  aw(2)  +  •  •  •  +  ox(j)  <  d  and 

p  +  >  d.  By  definition  p  +  q  =  JT  a^.  It  follows  that  q  >  Y2i  ai  ~  d.  Since 

q  <  k  =  Y2i.  ai  ~  d,  we  conclude  that  p  =  d  or  that  the  instance  of  TASK  SEQUENCING 
corresponds  to  a  “Yes”  instance  of  SUBSET  SUM.  Similarly,  consider  a  “Yes”  instance  of 
SUBSET  SUM.  It  follows  from  the  above  argument  that  there  is  a  permutation  such  that  the 
instance  of  TASK  SEQUENCING  is  a  “Yes”  instance.  ■ 


The  following  NP-complete  problem  is  closely  related  to  the  P-complete  problem  LINEAR 
INEQUALITIES.  The  difference  is  that  the  vector  x  must  be  a  0-1  vector  in  the  case  of  0-1 
INTEGER  PROGRAMMING,  whereas  in  LINEAR  INEQUALITIES  it  can  be  a  vector  of  rationals. 
Thus,  changing  merely  the  conditions  on  the  vector  x  elevates  the  problem  from  P  to  NP  and 
makes  it  NP-complete. 

0-1  INTEGER  PROGRAMMING 

Instance:  An  n  X  m  matrix  A  and  a  column  n-vector  b,  both  over  the  ring  of  integers  for 

integers  n  and  m. 

Answer:  “Yes”  if  there  is  a  column  m-vector  x  over  the  set  {0,  1}  such  that  Ax  =  b. 

THEOREM  8. 10.8  0-1  INTEGER  PROGRAMMING  is  NV -complete. 

Proof  To  show  that  0-1  INTEGER  PROGRAMMING  is  in  NP,  we  note  that  a  0-1  vector  x 
can  be  chosen  nondeterministically  in  n  steps,  after  which  verification  that  it  is  a  solution  to 
the  problem  can  be  done  in  0(n2)  steps  on  a  RAM  and  0(n4)  steps  on  a  DTM. 

To  show  that  0-1  INTEGER  PROGRAMMING  is  NP-hard  we  give  a  log-space  reduc¬ 
tion  of  3-SAT  to  it.  Given  an  instance  of  3-SAT,  namely,  a  set  of  literals  X  =  (x\, 
X\,X2,X2,  ■  ■  ■ ,  xn,  xn )  and  a  sequence  of  clauses  C  =  (c\,  c2, . . . ,  cm ),  where  each  clause  Q 
is  a  subset  of  X  containing  at  most  three  literals,  we  construct  an  to  X  p  matrix  A  =  [B  \  C] , 
where  B  =  \bi,j\  for  1  <  i,j  <  n  and  C  =  [criS]  for  1  <  r  <  n  and  1  <  s  <  pm.  We 
also  construct  a  column  p-vector  d  as  shown  below,  where  p  =  (m  +  1  )n.  The  entries  of  B 
and  C  are  defined  below. 


b 


C-r.s  — 


1  if  Xj  £  Ci  for  1  <  j  <  n 

—  1  if  Xj  £  Cj  for  1  <  j  <  n 

—  1  if  (r  —  l)n  +  \  <  s  <  rn 

0  otherwise 


Since  no  one  clause  contains  both  Xj  and  Xj,  this  definition  of  is  consistent. 

We  also  let  di,  the  ith  component  of  d,  satisfy  di  =  1  —  where  qi  is  the  number  of 
complemented  variables  in  Cj.  Thus,  the  matrix  A  has  the  form  given  below,  where  B  is  an 
m  X  n  matrix  and  each  row  of  A  contains  n  instances  of — 1  outside  of  B  in  non-overlapping 
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We  show  that  the  instance  of  3-SAT  is  a  “Yes”  instance  if  and  only  if  this  instance  of  0-1 
INTEGER  PROGRAMMING  is  a  “Yes”  instance,  that  is,  if  and  only  if  Ax  =  d. 

We  write  the  column  p-vector  x  as  the  concatenation  of  the  column  m-vector  u  and 
the  column  mro-vector  v.  It  follows  that  Ax  =  b  if  and  only  if  Au  >  b.  Now  consider 
the  ith  component  of  Au.  Let  u  select  ki  uncomplemented  and  Zj  complemented  variables 
of  clause  c,.  Then,  Au  >6  if  and  only  if  fc,;  —  >  d.;  =  \  —  q%  or  ki  +  ©  —  Zj)  >  1 
for  all  i.  Now  let  Xi  =  Ui  for  1  <  i  <  n.  Then  ki  and  qi  —  Zj  are  the  numbers  of 
uncomplemented  and  complemented  variables  in  Ci  that  are  set  to  1  and  0,  respectively. 
Since  ki  +  (qi  —  U)  >  1 ,  c,;  is  satisfied,  as  are  all  clauses,  giving  us  the  desired  result.  ■ 

8.11  The  Boundary  Between  P  and  NP 

It  is  important  to  understand  where  the  boundary  lies  between  problems  in  P  and  the  NP- 
complete  problems.  While  this  topic  is  wide  open,  we  shed  a  modest  amount  of  light  on  it  by 
showing  that  2-SAT,  the  version  of  3-SAT  in  which  each  clause  has  at  most  two  literals,  lies  on 
the  P-side  of  this  boundary,  as  shown  below.  In  fact,  it  is  in  NL,  which  is  in  P. 

THEOREM  8. 1  I .  I  2-SAT  is  in  NL. 

Proof  Given  an  instance  I  of  2-SAT,  we  first  insure  that  each  clause  has  exactly  two  distinct 
literals  by  adding  to  each  one-literal  clause  a  new  literal  £  that  is  not  used  elsewhere.  We 
then  construct  a  directed  graph  G  =  (V,E)  with  vertices  V  labeled  by  the  literals  x  and  x 
for  each  variable  x  appearing  in  I.  This  graph  has  an  edge  (a,  (3)  in  E  directed  from  vertex 
a  to  vertex  (3  if  the  clause  (5  V  (3)  is  in  I.  If  (a  V  (3)  is  in  /,  so  is  (f3  V  a)  because  of 
commutativity  of  V.  Thus,  if  (a,  (3)  £  E,  then  ((3,a)  £  E  also.  (See  Fig.  8.15.)  Note 
that  (a,  (3)  yf  ((3,  a)  because  this  requires  that  (3  =  a,  which  is  not  allowed.  Let  a  yf  7. 
It  follows  that  if  there  is  a  path  from  a  to  7  in  G,  there  is  a  distinct  path  from  7  to  5 
obtained  by  reversing  the  directions  of  each  edge  on  the  path  and  replacing  the  literals  by 
their  complements. 

To  understand  why  these  edges  are  chosen,  note  that  if  all  clauses  of  I  are  satisfied  and 
(5  V  (3)  is  in  I,  then  a  =  1  implies  that  (3  =  1 .  This  implication  relation,  denoted  a  =>  f3, 
is  transitive.  If  there  is  a  path  ( a.\,a.2 ,  •  •  ■  >0©  in  G,  then  there  are  clauses  (a.\  V  c©, 
(52  V  07), .  . . ,  (5fc_i  V  ctfc)  in  I.  If  all  clauses  are  satisfied  and  if  the  literal  a \  =  1,  then 
each  un-negated  literal  on  this  path  must  have  value  1 . 

We  now  show  that  an  instance  I  is  a  “No”  instance  if  and  only  if  there  is  a  variable  x 
such  that  there  is  a  path  in  G  from  x  to  x  and  one  from  x  to  x. 

If  there  is  a  variable  x  such  that  such  paths  exists,  this  means  that  x  =>  x  and  x  =>  x 
which  is  a  logical  contradiction.  This  implies  that  the  instance  /  is  a  “No”  instance. 

Conversely,  suppose  /  is  a  “No”  instance.  To  prove  there  is  a  variable  x  such  that  there 
are  paths  from  vertex  x  to  vertex  x  and  from  x  to  x,  assume  that  for  no  variable  x  does  this 
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Figure  8.15  A  graph  capturing  the  implications  associated  with  the  following  satisfiable  instance 
of  2-SAT:  (xi  V  x2 )  A  (x3  V  X\)  A  (x3  V  x2)  A  ( X\  V  x2)  A  (*3  V  *1). 


condition  hold  and  show  that  /  is  a  “Yes”  instance,  that  is,  every  clause  is  satisfied,  which 
contradicts  the  assumption  that  I  is  a  “No”  instance. 

Identify  a  variable  that  has  not  been  assigned  a  value  and  let  a  be  one  of  the  two  cor¬ 
responding  literals  such  that  there  is  no  directed  path  in  G  from  the  vertex  a  to  a.  (By 
assumption,  this  must  hold  for  at  least  one  of  the  two  literals  associated  with  x.)  Assign 
value  1  to  a  and  each  literal  A  reachable  from  it.  (This  assigns  values  to  the  variables  iden¬ 
tified  by  these  literals.)  If  these  assignments  can  be  made  without  assigning  a  variable  both 
values  0  and  1,  each  clause  can  be  satisfied  and  /  is  “Yes”  instance  rather  than  a  “No”  one,  as 
assumed.  To  show  that  each  variable  is  assigned  a  single  value,  we  assume  the  converse  and 
show  that  the  conditions  under  which  values  are  assigned  to  variables  by  this  procedure  are 
contradicted.  A  variable  can  be  assigned  contradictory  values  in  two  ways:  a)  on  the  current 
step  the  literals  A  and  A  are  both  reachable  from  a  and  assigned  value  1 ,  and  b)  a  literal  A 
is  reachable  from  a  on  the  current  step  that  was  assigned  value  0  on  a  previous  step.  For 
the  first  case  to  happen,  there  must  be  a  path  from  a  to  vertices  A  and  A.  By  design  of  the 
graph,  if  there  is  a  path  from  a  to  A,  there  is  a  path  from  A  to  a.  Since  there  is  a  path  from 
a  to  A,  there  must  be  a  path  from  a  to  a,  contradicting  the  assumption  that  there  are  no 
such  paths.  In  the  second  case,  let  a  A  be  assigned  1  on  the  current  step  that  was  assigned  0 
on  a  previous  step.  It  follows  that  A  was  given  value  1  on  that  step.  Because  there  is  a  path 
from  a  to  A,  there  is  one  from  A  to  a  and  our  procedure,  which  assigned  A  value  1  on  the 
earlier  step,  must  have  assigned  a  value  1  on  that  step  also.  Thus,  a  had  the  value  0  before 
the  current  step,  contradicting  the  assumption  that  it  was  not  assigned  a  value. 

To  show  that  2-SAT  is  in  NL,  recall  that  NL  is  closed  under  complements.  Thus,  it  suf¬ 
fices  to  show  that  “No”  instances  of  2-SAT  can  be  accepted  in  nondeterministic  logarithmic 
space.  By  the  above  argument,  if  /  is  a  “No”  instance,  there  is  a  variable  x  such  that  there  is 
a  path  in  G  from  x  to  x  and  from  x  to  x.  Since  the  number  of  vertices  in  G  is  at  most  linear 
in  n,  the  length  of  I  (it  may  be  as  small  as  0(yfn)),  an  NDTM  can  propose  and  then  verify 
in  space  0(log  n)  a  path  in  G  from  x  to  x  and  back  by  checking  that  the  putative  edges  are 
edges  of  G,  that  x  is  the  first  and  last  vertex  on  the  path,  and  that  x  is  encountered  before 
the  end  of  the  path.  ■ 
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8.12  PSPACE-Complete  Problems 

PSPACE  is  the  class  of  decision  problems  that  are  decidable  by  a  Turing  machine  in  space  poly¬ 
nomial  in  the  length  of  the  input.  Problems  in  PSPACE  are  potentially  much  more  complex 
than  problems  in  P. 

The  hardest  problems  in  PSPACE  are  the  PSPACE-complete  problems.  (See  Section  8.8.) 
Such  problems  have  two  properties:  a)  they  are  in  PSPACE  and  b)  every  problem  in  PSPACE 
can  be  reduced  to  them  by  a  polynomial-time  Turing  machine.  The  PSPACE-complete  prob¬ 
lems  are  the  hardest  problems  in  PSPACE  in  the  sense  that  if  they  are  in  P,  then  so  are  all 
problems  in  PSPACE,  an  unlikely  prospect. 

We  now  establish  that  QUANTIFIED  SATISFIABILITY  defined  below  is  PSPACE-complete. 
We  also  show  that  GENERALIZED  GEOGRAPHY,  a  game  played  on  a  graph,  is  PSPACE- 
complete  by  reducing  QUANTIFIED  SATISFIABILITY  to  it.  A  characteristic  shared  by  many 
important  PSPACE-complete  problems  and  these  two  problems  is  that  they  are  equivalent  to 
games  on  graphs. 

8. 12. 1  A  First  PSPACE-Complete  Problem 

Quantified  Boolean  formulas  use  existential  quantification,  denoted  3,  and  universal  quan¬ 
tification,  denoted  V.  Existential  quantification  on  variable  X\,  denoted  3:7: i ,  means  “there 
exists  a  value  for  the  Boolean  variable  X\ ,”  whereas  universal  quantification  on  variable  a; 2, 
denoted  VX2,  means  “for  all  values  of  the  Boolean  variable  aQ.”  Given  a  Boolean  formula  such 
as  (xi  V  X2  V  x 3)  A  (ah  V  X2  V  X3)  A  (ah  V  ah  V  X3),  a  quantification  of  it  is  a  collection  of 
universal  or  existential  quantifiers,  one  per  variable  in  the  formula,  followed  by  the  formula. 
For  example, 


Vxi3x2Vx3[(xi  V  X2  V  X})  A  (xi  V  au  V  X3)  A  (ah  V12V  X3)] 

is  a  quantified  formula.  Its  meaning  is  “for  all  values  of  ah,  does  there  exist  a  value  for  X2  such 
that  for  all  values  ofx3  the  formula  (xi  V  X2  VX3)  A  (ah  VX2  VX3)  A  (ah  VX2  VX3)  is  satisfied?” 
In  this  case  the  answer  is  “No”  because  for  Xi  =  1,  the  function  is  not  satisfied  with  X3  =  0 
when  X2  =  0  and  is  not  satisfied  with  X3  =  1  when  X2  =  1 .  However,  if  the  third  quantifier 
is  changed  from  universal  to  existential,  then  the  quantified  formula  is  satisfied.  Note  that  the 
order  of  the  quantifiers  is  important.  To  see  this,  observe  that  under  the  quantification  order 
VX1VX33X2  that  the  quantified  formula  is  satisfied. 

QUANTIFIED  SATISFIABILITY  consists  of  satisfiable  instances  of  quantified  Boolean  for¬ 
mulas  in  which  each  formula  is  expressed  as  a  set  of  clauses. 

QUANTIFIED  SATISFIABILITY 

Instance:  A  set  of  literals  X  =  {xi,Xi,  X2,X2, . . . ,  xn,xn},  a  sequence  of  clauses  C  = 
(ci,  C2, . . . ,  cm),  where  each  clause  Cj  is  a  subset  of  X,  and  a  sequence  of  quantifiers 
(Q\,Q2,  ■  •  ■ :  Qn )>  where  Qj  £  {V,  3}. 

Answer:  “Yes”  if  under  the  quantifiers  Q1X1Q2X2  •  •  •  Qnxra,  the  clauses  C\,  C2,  ■  ■  . ,  cm  are 
satisfied,  denoted 


QlX\Q2%2  *  ’  '  Qn^n  [0] 

where  the  formula  (j)  —  Ci  AC2A-  •  •  Acm  is  in  the  product-of-sums  form.  (See  Section  2.2.) 
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In  this  section  we  establish  the  following  result,  stronger  than  PSPACE-completeness  of 
QUANTIFIED  SATISFIABILITY:  we  show  it  is  complete  for  PSPACE  under  log-space  trans¬ 
formations.  Reductions  of  this  type  are  potentially  stronger  than  polynomial-time  reductions 
because  the  transformation  is  executed  in  logarithmic  space,  not  polynomial  time.  While  it 
is  true  that  every  log-space  transformation  is  a  polynomial-time  transformation  (see  Theo¬ 
rem  8.5.8),  it  is  not  known  if  the  reverse  is  true.  We  prove  this  result  in  two  stages:  we  first 
show  that  QUANTIFIED  SATISFIABILITY  is  in  PSPACE  and  then  that  it  is  hard  for  PSPACE. 

LEMMA  8.  I  2. 1  QUANTIFIED  SATISFIABILITY  is  in  PSPACE. 

Proof  To  show  that  QUANTIFIED  SATISFIABILITY  is  in  PSPACE  we  evaluate  in  polyno¬ 
mial  space  a  circuit,  Cqsat,  whose  value  is  1  if  and  only  if  the  instance  of  QUANTIFIED 
SATISFIABILITY  is  a  “Yes”  instance.  The  circuit  Cqsat  is  a  tree  all  of  whose  paths  from  the 
inputs  to  the  output  (root  of  the  tree)  have  the  same  length,  each  vertex  is  either  an  AND 
gate  or  an  OR  gate,  and  each  input  has  value  0  or  1.  (See  Fig.  8.16.)  The  gate  at  the  root  of 
the  tree  is  associated  with  the  variable  X\,  the  gates  at  the  next  level  are  associated  with  X2, 
etc.  The  type  of  gate  at  the  jth  level  is  determined  by  the  jth  quantifier  Qj  and  is  AND  if 
Qj  =  V  and  OR  if  Qj  =  3.  The  leaves  correspond  to  all  2"  the  values  of  the  n  variables: 
at  each  level  of  the  tree  the  left  and  right  branches  correspond  to  the  values  0  and  1  for  the 
corresponding  quantified  variable.  Each  leaf  of  the  tree  contains  the  value  of  the  formula  <f> 
for  the  values  of  the  variables  leading  to  that  leaf.  In  the  example  of  Fig.  8.16  the  leftmost 
leaf  has  value  1  because  on  input  X\  =  x2  =  X3  =  0  each  of  the  three  clauses  {x\,  X2,  x$}, 
{xi,X2,Xi}  and  {xi,X2,X3}  is  satisfied. 

It  is  straightforward  to  see  that  the  value  at  the  root  of  the  tree  is  1  if  all  clauses  are 
satisfied  under  the  quantifiers  Q\X\Q2X2  ■  ■  •  Qnxn  and  0  otherwise.  Thus,  the  circuit  solves 
the  QUANTIFIED  SATISFIABILITY  problem  and  its  complement.  (Note  that  PSPACE  = 
coPSPACE,  as  shown  in  Theorem  8.6.1.) 


10  11  0  1 


1  0 


xi  y\ 

0  1 

*2  y\ 

0  1 

*3  y\ 

0  1 


Figure  8. 1  6  A  tree  circuit  constructed  from  the  instance  VxiBa^Va^  for  (j>  —  (xi  V  X2  V 
Xi)  A  (xi  Vl2  V  Xi)  A  (xi  Vx2  V  Xi)  of  QUANTIFIED  SATISFIABILITY.  The  eight  values  at 
the  leaves  of  the  tree  are  the  values  of  (j>  on  the  eight  different  assignments  to  (xi,  x2,  Xi). 
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tree_eval  (n,  (f>,Q,d,w)\ 
if  d  =  n  then 

return  (evaluate  ((f>,  w)); 
else 

if  first(Q)  =  3  then 

return  (tree_eval  (n,  (j> ,  rest  (Q)  ,  d+l,t(;0) 
OR  tree_eval  (n,  </>,  rest(Q),  e?  +  1,  wl)  ); 

else 

return  (tree_eval  (n,  (j) ,  rest  (Q)  ,  d+l,it;0) 
AND  tree_eval(n,  (f>,  rest  (Q)  ,  d  +  1,  wl)  ) ; 


Figure  8.17  A  program  for  the  recursive  procedure  tree_eval(n,  <j>,  Q ,  d ,  w).  The  tuple  w 
keeps  track  of  the  path  taken  into  the  tree. 


The  circuit  Cqsat  has  size  exponential  in  n  because  there  are  2"  values  for  the  n  variables. 
However,  it  can  be  evaluated  in  polynomial  space,  as  we  show.  For  this  purpose  consider  the 
recursive  procedure  tree_eval  (n,  (j),  Q ,  d,  w)  in  Fig.  8.17  that  evaluates  Cqsat.  Here  n  is 
the  number  of  variables  in  the  quantization,  d  is  the  depth  of  recursion,  <f>  is  the  expression 
over  which  quantification  is  done,  Q  is  a  sequence  of  quantifiers,  and  w  holds  the  values  for 
d  variables.  Also,  first  ( Q )  and  rest  ( Q )  are  the  first  and  all  but  the  first  components  of 
Q ,  respectively.  When  d  =  0,  Q  =  (Qi,  Qi>  ■  ■  ■  >  Qn)  and  Q1X1Q2X2  •  •  •  Qnxn  h  the 
expression  to  evaluate.  We  show  that  tree_eval  (n,  (j),  Q ,  0,  e)  can  be  computed  in  space 
quadratic  in  the  length  of  an  instance  of  QUANTIFIED  SATISFIABILITY. 

When  d  =  n,  the  procedure  has  reached  a  leaf  of  the  tree  and  the  string  w  contains 
values  for  the  variables  X\,  X2,  ■  •  ■ ,  xn,  in  that  order.  Since  all  variables  of  <j>  are  known  when 
d  =  n,4>  can  be  evaluated.  Let  evaluate (<f>,  w)  be  the  function  that  evaluates  tfi  with  values 
specified  by  w.  Clearly  tree_eval  (n,  <f>,  Q,  0,  e)  is  the  value  of  Q1X1Q2X2  •  •  •  Qnxn  4>- 

We  now  determine  the  work  space  needed  to  compute  tree_eval  ( n ,  <f>,  Q ,  d,  w)  on 
a  DTM.  (The  discussion  in  the  proof  of  Theorem  8.5.5  is  relevant.)  Evaluation  of  this 
procedure  amounts  to  a  depth-first  traversal  of  the  tree.  An  activation  record  is  created  for 
each  call  to  the  procedure  and  is  pushed  onto  a  stack.  Since  the  depth  of  the  tree  is  n,  at  most 
n  +  1  records  will  be  on  the  stack.  Since  each  activation  record  contains  a  string  of  length  at 
most  O(n),  the  total  space  used  is  0(n2).  And  the  length  of  Q1X1Q2X2  ■  ■  •  Qnxn  4>  is  at 
least  n,  the  space  is  polynomial  in  the  length  of  this  formula.  ■ 


LEMMA  8. 1 2.2  QUANTIFIED  SATISFIABILITY  is  log-space  bard for  PSPACE. 

Proof  Our  goal  is  to  show  that  every  decision  problem  V  £  PSPACE  can  be  reduced  in 
log-space  to  an  instance  of  QUANTIFIED  SATISFIABILITY.  Instead,  we  show  that  every  such 
V  can  be  reduced  in  log-space  to  a  “No”  instance  of  QUANTIFIED  SATISFIABILITY  (we  call 
this  QUANTIFIED  UNSATISFIABILITY).  But  a  “No”  instance  is  one  for  which  the  formula 
(j>,  which  is  in  product-of-sums  form,  is  not  satisfied  under  the  specified  quantification  or 
that  its  Boolean  complement,  which  is  in  sum-of-products  expansion  (SOPE)  form,  is  sat¬ 
isfied  under  a  quantification  in  which  V  is  replaced  by  3  and  vice  versa.  Exchanging  “Yes” 
and  “No”  instances  of  decision  problems  (which  we  can  do  since  PSPACE  is  closed  un- 
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der  complements),  we  have  that  every  problem  in  coPSPACE  can  be  reduced  in  log-space 
to  QUANTIFIED  SATISFIABILITY.  However,  since  PSPACE  =  coPSPACE,  we  have  the 
desired  result. 

Our  task  now  is  to  show  that  every  problem  V  £  PSPACE  can  be  reduced  in  log-space 
to  an  instance  of  QUANTIFIED  UNSATISFIABILITY.  Let  L  £  PSPACE  be  the  language 
of  “Yes”  instances  of  V  and  let  M  be  the  DTM  deciding  L.  Instances  of  QUANTIFIED 
UNSATISFIABILITY  will  be  quantified  formulas  in  SOPE  form  that  describe  conditions  on 
the  configuration  graph  G(M,  w)  of  M  on  input  w.  We  associate  a  Boolean  vector  with 
each  vertex  in  G(M,  w)  and  assume  that  G(M,  w)  has  one  initial  and  final  vertex  associated 
with  the  vectors  a  and  b,  respectively.  (We  can  make  the  last  assumption  because  M  can  be 
designed  to  enter  a  cleanup  phase  in  which  it  prints  blanks  in  all  non-blank  tape  cells.) 

Let  c  and  d  be  vector  encodings  of  arbitrary  configurations  c  and  d  of  G(M,  w).  We 
construct  formulas  tpi(c,d),  0  <  i  <  k,  in  SOPE  form  that  are  satisfied  if  and  only  if 
there  exists  a  path  from  c  to  d  in  G(M,  w)  of  length  at  most  2*  (it  computes  the  predi¬ 
cate  PATH(c,  d,  2l)  introduced  in  the  proof  of  Theorem  8.5.5).  Then  a  “Yes”  instance  of 
QUANTIFIED  UNSATISFIABILITY  is  the  formula  ,i/>fc(a,  b),  where  a  and  b  are  encodings 
of  the  initial  and  final  vertices  of  G(M,  w)  for  k  sufficiently  large  that  a  polynomial-space 
computation  can  be  done  in  time  2k .  Since,  as  seen  in  Theorem  8.5.6,  a  deterministic  com¬ 
putation  in  space  S  is  done  in  time  0(23),  it  suffices  for  k  to  be  polynomial  in  the  length 
of  the  input. 

The  formula  ipo(c,  d)  is  satisfiable  if  either  c  =  d  or  d  follows  from  c  in  one  step.  Such 
formulas  are  easily  computed  from  the  descriptions  of  M  and  w.  Me,  d)  can  be  expressed 
as  shown  below,  where  the  existential  quantification  is  over  all  possible  intermediate  config¬ 
urations  e  of  M.  (See  the  proof  of  Theorem  8.5.5  for  the  representation  of  PATH(c,  d,  2l) 
in  terms  of  PATH(c,  e,  21"1)  and  PATH(e,  d,  2i"1).) 

Mc’d )  =  3e  e)  A  Mi(e,d)\  (8.1) 

Note  that  3e  is  equivalent  to  3ei3e2  •  •  •  3eg,  where  q  is  the  length  of  e.  Universal  quantifi¬ 
cation  over  a  vector  is  expanded  in  a  similar  fashion. 

Unfortunately,  for  i  =  k  this  recursively  defined  formula  requires  space  exponential 
in  the  size  of  the  input.  Fortunately,  we  can  represent  i/j,  (c,  d)  more  succinctly  using  the 
implication  operator  x  =>  y,  as  shown  below,  where  x  =>  y  is  equivalent  to  x  V  y.  Note 
that  if  x  =>  y  is  TRUE,  then  either  x  is  FALSE  or  x  and  y  are  both  TRUE. 

d)  =  3e  [V® \/y  [(®  =  c  A  y  =  e)  V  (x  =  e  A  y  =  d)]  =>  -0i_i  y)\  (8.2) 

Here  x  =  y  denotes  ( X\  =  y\)  A  (®2  =  Vi)  A  •  •  •  A  ( xq  =  yq),  where  (a =  t/j)  denotes 
Xiyi  V  Xiyt.  Then,  the  formula  in  the  outer  square  brackets  of  (8.2)  is  true  when  either 
(®  =  cAy  =  e)V(®  =  eAy  =  d)is  FALSE  or  this  expression  is  TRUE  and  tpi-i  (x,  y)  is 
also  TRUE.  Because  the  contents  of  the  outer  square  brackets  are  TRUE,  the  quantization  on 
x  and  y  requires  that  t/’j_i(c,  e)  and  i/>j_i(e,  d)  both  be  TRUE  or  that  the  formula  given 
in  (8.1)  be  satisfied. 

It  remains  to  convert  the  expression  for  ipi(c,  d)  given  above  to  SOPE  form  in  log-space. 
But  this  is  straightforward.  We  replace  g  =>  h  by  g  V  h,  where  g  =  (r  A  s)  V  (f  A  u)  and 
r  =  (x  =  c),  s  =  (y  =  e),  t  =  (x  =  e ),  and  u  =  (y  =  d).  It  follows  that 

g  =  (f  Vs)  A(tV«) 

=  (r  A  t )  V  (r  A  u)  V  (s  A  t )  V  (s  A  u) 
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Since  each  of  r,  s,  t,  and  u  can  be  expressed  as  a  conjunction  of  q  terms  of  the  form 
(xj  =  yf)  and  (xj  =  yf)  =  (xjUj  V  xfyf),  1  <  i  <  q,  it  follows  that  r,  s,  t,  and  u 
can  each  be  expressed  as  a  disjunction  of  2 q  terms.  Each  of  the  four  terms  of  the  form 
(r  A  t)  consists  of  4q2  terms,  each  of  which  is  a  conjunction  of  four  literals.  Thus,  g  is  the 
disjunction  of  16<72  terms  of  four  literals  each. 

Given  the  regular  structure  of  this  formula  for  ipi,  it  can  be  generated  from  a  formula  for 
ipi- 1  in  space  0(log  q).  Since  0  <  i  <  k  and  k  is  polynomial  in  the  length  of  the  input,  all 
the  formulas,  including  that  for  ipk,  can  be  generated  in  log-space.  By  the  above  reasoning, 
this  formula  is  a  “Yes”  instance  of  QUANTIFIED  UNSATISFIABILITY  if  and  only  if  there  is  a 
path  in  the  configuration  graph  G(M,  w)  between  the  initial  and  final  states.  ■ 

Combining  the  two  results,  we  have  the  following  theorem. 

THEOREM  8. 1 2. 1  QUANTIFIED  SATISFIABILITY  is  log-space  complete  for  PSPACE. 

8.12.2  Other  PSPACE-Complete  Problems 

An  important  version  of  QUANTIFIED  SATISFIABILITY  is  ALTERNATING  QUANTIFIED  SAT¬ 
ISFIABILITY. 

ALTERNATING  QUANTIFIED  SATISFIABILITY 

Instance:  Instances  of  QUANTIFIED  SATISFIABILITY  that  have  an  even  number  of  quanti¬ 
fiers  that  alternate  between  3  and  V,  with  3  the  first  quantifier. 

Answer:  “Yes”  if  the  instance  is  a  “Yes”  instance  of  QUANTIFIED  SATISFIABILITY. 

THEOREM  8.12.2  ALTERNATING  QUANTIFIED  SATISFIABILITY  is  log-space  complete  for 

PSPACE. 

Proof  ALTERNATING  QUANTIFIED  SATISFIABILITY  is  in  PSPACE  because  it  is  a  special 
case  of  QUANTIFIED  SATISFIABILITY.  We  reduce  QUANTIFIED  SATISFIABILITY  to  AL¬ 
TERNATING  QUANTIFIED  SATISFIABILITY  in  log-space  as  follows.  If  two  universal  quan¬ 
tifiers  appear  in  succession,  we  add  an  existential  quantifier  between  them  in  a  new  variable, 
say  Xi,  and  add  the  new  clause  {x  i,xi}  at  the  end  of  the  formula  (f>.  If  two  existential  quan¬ 
tifiers  appear  in  succession,  add  universal  quantification  over  a  new  variable  and  a  clause 
containing  it  and  its  negation.  If  the  number  of  quantifiers  is  not  even,  repeat  one  or  the 
other  of  the  above  steps.  This  transformation  at  most  doubles  the  number  of  variables  and 
clauses  and  can  be  done  in  log-space.  The  instance  of  ALTERNATING  QUANTIFIED  SATIS¬ 
FIABILITY  is  a  “Yes”  instance  if  and  only  if  the  instance  of  QUANTIFIED  SATISFIABILITY  is 
a  “Yes”  instance.  ■ 

The  new  version  of  QUANTIFIED  SATISFIABILITY  is  akin  to  a  game  in  which  universal 
and  existential  players  alternate.  The  universal  player  attempts  to  show  a  fact  for  all  values  of 
its  Boolean  variable,  whereas  the  existential  player  attempts  to  deny  that  fact  by  the  choice  of 
its  existential  variable.  It  is  not  surprising,  therefore,  that  many  games  are  PSPACE-complete. 
The  geography  game  described  below  is  of  this  type. 

The  geography  game  is  a  game  for  two  players.  They  alternate  choosing  names  of  cities 
in  which  the  first  letter  of  the  next  city  is  the  last  letter  of  the  previous  city  until  one  of  the  two 
players  (the  losing  player)  cannot  find  a  name  that  has  not  already  been  used.  (See  Fig.  8.18.) 
This  game  is  modeled  by  a  graph  in  which  each  vertex  carries  the  name  of  a  city  and  there  is 
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an  edge  from  vertex  Mi  to  vertex  112  if  the  last  letter  in  the  name  associated  with  U\  is  the  first 
letter  in  the  name  associated  with  tij.  In  general  this  graph  is  directed  because  an  edge  from 
Mi  to  M2  does  not  guarantee  an  edge  from  ui  to  Mi. 

GENERALIZED  GEOGRAPHY 

Instance:  A  directed  graph  G  =  (V,  E )  and  a  vertex  v. 

Answer:  “Yes”  if  there  is  a  sequence  of  (at  most  |V|)  alternating  vertex  selections  by  two 
players  such  that  vertex  v  is  the  first  selection  by  the  first  player  and  for  each  selection  of 
the  first  player  and  all  selections  of  the  second  player  of  vertices  adjacent  to  the  previous 
selection,  the  second  player  arrives  at  a  vertex  from  which  it  cannot  select  a  vertex  not 
previously  selected. 

THEOREM  8. 1  2.3  GENERALIZED  GEOGRAPHY  is  log-space  complete  for  PSPACE. 

Proof  To  show  that  GENERALIZED  GEOGRAPHY  is  log-space  complete  for  PSPACE,  we 
show  that  it  is  in  PSPACE  and  that  QUANTIFIED  SATISFIABILITY  can  be  reduced  to  it 
in  log-space.  To  establish  the  first  result,  we  show  that  the  outcome  of  GENERALIZED 
GEOGRAPHY  can  be  determined  by  evaluating  a  graph  similar  to  the  binary  tree  used  to 
show  that  QUANTIFIED  SATISFIABILITY  is  realizable  in  PSPACE. 

Given  the  graph  G  =  (V,E)  (see  Fig.  8.18(a)),  we  construct  a  search  graph  (see 
Fig.  8.18(b))  by  performing  a  variant  of  depth-first  search  of  G  from  v.  At  each  vertex 
we  visit  the  next  unvisited  descendant,  continuing  until  we  encounter  a  vertex  on  the  cur¬ 
rent  path,  at  which  point  we  backtrack  and  try  the  next  sibling  of  the  current  vertex,  if  any. 
In  depth-first  search  if  a  vertex  has  been  visited  previously,  it  is  not  visited  again.  In  this 
variant  of  the  algorithm,  however,  a  vertex  is  revisited  if  it  is  not  on  the  current  path.  The 
length  of  the  longest  path  in  this  tree  is  at  most  |V|  —  1  because  each  path  can  contain  no 
more  than  |  V|  vertices.  The  tree  may  have  a  number  of  vertices  exponential  in  |  V|. 

At  a  leaf  vertex  a  player  has  no  further  moves.  The  first  player  wins  if  it  is  the  second 
player’s  turn  at  a  leaf  vertex  and  loses  otherwise.  Thus,  a  leaf  vertex  is  labeled  1  (0)  if  the 
first  player  wins  (loses).  To  insure  that  the  value  at  a  vertex  m  is  1  if  the  two  players  reach  u 
and  the  first  player  wins,  we  assign  OR  operators  to  vertices  at  which  the  first  player  makes 
selections  and  AND  operators  otherwise.  (The  output  of  a  one-input  AND  or  OR  gate  is  the 


Marblehead 


Figure  8.18  (a)  A  graph  for  the  generalized  geography  game  and  (b)  the  search  tree  associated 
with  the  game  in  which  the  start  vertex  is  Marblehead. 
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value  of  its  input.)  This  provides  a  circuit  that  can  be  evaluated  just  as  was  the  circuit  C^at 
used  in  the  proof  ofTheorem  8.12.1.  The  “Yes”  instances  of  GENERALIZED  GEOGRAPHY 
are  such  that  the  first  player  can  win  by  choosing  a  first  city.  In  Fig.  8.18  the  value  of  the 
root  vertex  is  0,  which  means  that  the  first  player  loses  by  choosing  to  start  with  Marblehead 
as  the  first  city. 

Vertices  labeled  AND  or  OR  in  the  tree  generated  by  depth-first  search  can  have  arbitrary 
in-degree  because  the  number  of  vertices  that  can  be  reached  from  a  vertex  in  the  original 
graph  is  not  restricted.  The  procedure  tree_eval  described  in  the  proof  ofTheorem  8.12. 1 
can  be  modified  to  apply  to  the  evaluation  of  this  DAG  whose  vertex  in-degree  is  potentially 
unbounded.  (See  Problem  8.30.)  This  modified  procedure  runs  in  space  polynomial  in  the 
size  of  the  graph  G. 

We  now  show  that  ALTERNATING  QUANTIFIED  SATISFIABILITY  (abbreviated  AQSAT) 
can  be  reduced  in  log-space  to  GENERALIZED  GEOGRAPHY.  Given  an  instance  of  AQSAT 
such  as  that  shown  below,  we  construct  an  instance  of  GENERALIZED  GEOGRAPHY,  as 
shown  in  Fig.  8.19.  We  assume  without  loss  of  generality  that  the  number  of  quantifiers  is 
even.  If  not,  add  a  dummy  variable  and  quantify  on  it: 

3x\ix2^Xiix^[{x\  V  X2  V  xi)  A  (x\  V12V  £3)  A  (x\  V  X2  V  x$)  A  (X4  V  £4)] 


1  0 

/\ 

1  0 

/\ 

1  0 

/\ 

1  0 


X\  V  X2  V  £3  X\  V  X2  V  £3  x\  V  X2  V  £3  X4  V  X4 

Figure  8.19  An  instance  of  GENERALIZED  GEOGRAPHY  corresponding  to  an  instance  of 
ALTERNATING  QUANTIFIED  SATISFIABILITY. 
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The  instance  of  GENERALIZED  GEOGRAPHY  corresponding  to  an  instance  of  AQSAT 
is  formed  by  cascading  a  set  of  diamond-shaped  subgraphs,  one  per  variable  (see  Fig.  8.19), 
and  connecting  the  bottom  vertex  b  in  the  last  diamond  to  a  set  of  vertices,  one  per  clause. 
An  edge  is  drawn  from  a  clause  to  a  vertex  associated  with  a  literal  (Xi  or  Xi)  if  that  literal 
is  in  the  clause.  The  literal  Xi  (Xi)  is  associated  with  the  middle  vertex  on  the  right-hand 
(left-hand)  side  of  a  diamond.  Thus,  in  the  example,  there  is  an  edge  from  the  leftmost 
clause  vertex  to  the  left-hand  vertex  in  the  diamond  for  £3  and  to  the  right-hand  vertices  in 
diamonds  for  X\  and  £2- 

Let  the  geography  game  be  played  on  this  graph  starting  with  the  first  player  from  the 
topmost  vertex  labeled  t .  The  first  player  can  choose  either  the  left  or  right  path.  The  second 
player  has  only  one  choice,  taking  it  to  the  bottom  of  the  first  diamond,  and  the  first  player 
now  has  only  one  choice,  taking  it  to  the  top  of  the  second  diamond.  The  second  player 
now  can  choose  a  path  to  follow.  Continuing  in  this  fashion,  we  see  that  the  first  (second) 
player  can  exercise  a  choice  on  the  odd-  (even-)  numbered  diamonds  counting  from  the  top. 
Since  the  number  of  quantifiers  is  even,  the  choice  at  the  bottom  vertex  labeled  b  belongs  to 
the  second  player.  Observe  that  whatever  choices  are  made  within  the  diamonds,  the  vertices 
labeled  m  and  b  are  visited. 

Because  the  goal  of  each  player  is  to  force  the  other  player  into  a  position  from  which 
it  has  no  moves,  at  vertex  b  the  second  player  attempts  to  choose  a  clause  vertex  such  that 
the  first  player  has  no  moves:  that  is,  every  vertex  reachable  from  the  clause  vertex  chosen  by 
the  second  player  has  already  been  visited.  On  the  other  hand,  if  all  clauses  are  satisfiable, 
then  for  every  clause  chosen  by  the  second  player  there  should  be  an  edge  from  its  vertex  to 
a  diamond  vertex  that  has  not  been  previously  visited.  To  insure  that  the  first  player  wins  if 
and  only  if  the  instance  of  AQSAT  used  to  construct  this  graph  is  a  “Yes”  instance,  the  first 
player  always  chooses  an  edge  according  to  the  directions  in  Fig.  8.19.  For  example,  it  visits 
the  vertex  labeled  X\  if  it  wishes  to  set  X\  =  1  because  this  means  that  the  vertex  labeled  X\ 
is  not  visited  on  the  path  from  t  to  b  and  can  be  visited  by  the  first  player  on  the  last  step  of 
the  game.  Since  each  vertex  labeled  m  and  b  is  visited  before  a  clause  vertex  is  visited,  the 
second  player  does  not  have  a  move  and  loses.  ■ 

8.13  The  Circuit  Model  of  Computation 

The  complexity  classes  seen  so  far  in  this  chapter  are  defined  in  terms  of  the  space  and 
time  needed  to  recognize  languages  with  deterministic  and  nondeterministic  Turing  machines. 
These  classes  generally  help  us  to  understand  the  complexity  of  serial  computation.  Circuit 
complexity  classes,  studied  in  this  section,  help  us  to  understand  parallel  computation. 

Since  a  circuit  is  a  fixed  interconnection  of  gates,  each  circuit  computes  a  single  Boolean 
function  on  a  fixed  number  of  inputs.  Thus,  to  compute  the  unbounded  set  of  functions 
computed  by  a  Turing  machine,  a  family  of  circuits  is  needed.  In  this  section  we  investigate 
uniform  and  non-uniform  circuit  families.  A  uniform  family  of  circuits  is  a  potentially  un¬ 
bounded  set  of  circuits  for  which  there  is  a  Turing  machine  that,  given  an  integer  n  in  unary 
notation,  writes  a  description  of  the  nth  circuit.  We  show  that  uniform  circuits  compute  the 
same  functions  as  Turing  machines. 

As  mentioned  below,  non-uniform  families  of  circuits  are  so  powerful  that  they  can  com¬ 
pute  functions  not  computed  by  Turing  machines.  Given  the  Church-Turing  thesis,  it  doesn’t 
make  sense  to  assume  non-uniform  circuits  as  a  model  of  computation.  On  the  other  hand,  if 
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we  can  develop  large  lower  bounds  on  the  size  or  depth  of  circuits  without  regard  to  whether  or 
not  they  are  drawn  from  a  uniform  family,  then  such  lower  bounds  apply  to  uniform  families 
as  well  and,  in  particular,  to  other  models  of  computation,  such  as  Turing  machines.  For  this 
reason  non-uniform  circuits  are  important. 

A  circuit  is  a  form  of  unstructured  parallel  machine,  since  its  gates  can  operate  in  parallel. 
The  parallel  random-access  machine  (PRAM)  introduced  in  Chapter  1  and  examined  in  Chap¬ 
ter  7  is  another  important  parallel  model  of  computation  in  terms  of  which  the  performance 
of  many  other  parallel  computational  models  can  be  measured.  In  Section  8.14  we  show  that 
circuit  size  and  depth  are  related  to  number  of  processors  and  time  on  the  PRAM.  These  results 
emphasize  the  important  role  of  circuits  not  only  in  the  construction  of  machines,  but  also  in 
measuring  the  serial  and  parallel  complexity  of  computational  problems. 

Throughout  the  following  sections  we  assume  that  circuits  are  constructed  from  gates  cho¬ 
sen  from  the  standard  basis  flo  =  {AND,  OR,  NOT}. 

We  now  explore  uniform  and  non-uniform  circuit  families,  thereby  setting  the  stage  for 
the  next  chapter,  in  which  methods  for  deriving  lower  bounds  on  the  size  of  circuits  are  devel¬ 
oped.  After  introducing  uniform  circuits  we  show  that  uniform  families  of  circuits  and  Turing 
machines  compute  the  same  functions.  We  then  introduce  a  number  of  languages  defined  in 
terms  of  the  properties  of  families  of  circuits  that  recognize  them. 

8.13.1  Uniform  Families  of  Circuits 

Families  of  circuits  are  useful  in  characterizing  decision  problems  in  which  the  set  of  instances 
is  unbounded.  One  circuit  in  each  family  is  associated  with  the  “Yes”  instances  of  each  length: 
it  has  value  1  on  the  “Yes”  instances  and  value  0  otherwise. 

Families  of  circuits  are  designed  in  Chapter  3  to  simulate  computations  by  finite-state, 
random-access,  and  Turing  machines  on  arbitrary  numbers  of  inputs.  For  each  machine  M 
of  one  of  these  types,  there  is  a  DTM  S(M)  such  that  on  an  input  of  length  n,  S(M)  can 
produce  as  output  the  description  of  a  circuit  on  n  inputs  that  computes  exactly  the  same 
function  as  does  M  on  n  inputs.  (See  the  program  in  Fig.  3.27.)  These  circuits  are  generated 
in  a  uniform  fashion. 

On  the  other  hand,  non-uniform  circuit  families  can  be  used  to  define  non-computable 
languages.  For  example,  consider  the  family  in  which  the  nth  circuit,  Cn,  is  designed  to  have 
value  1  on  those  strings  w  of  length  n  in  the  language  C\  defined  in  Section  5.7  and  value  0 
otherwise.  Such  a  circuit  realizes  the  minterm  defined  by  w.  As  shown  in  Theorem  5.7.4,  C\ 
is  not  recursively  enumerable;  that  is,  there  is  no  Turing  machine  that  can  recognize  it. 

This  example  motivates  the  need  to  identify  families  of  circuits  that  compute  functions 
computable  by  Turing  machines,  that  is,  uniform  families  of  circuits. 

DEFINITION  8. 1  3. 1  A  circuit  family  C  =  {C\,  C5,  C'3, . . .}  is  a  collection  of  logic  circuits  in 
which  Cn  has  n  inputs  and  m(n)  outputs  for  some  function  m  :  IN  1— >  IN. 

A  time-r(n )  (space-r(n))  uniform  circuit  family  is  a  circuit  family  for  which  there  is  a 
deterministic  Turing  machine  M  such  that  for  each  integer  n  supplied  in  unary  notation,  namely 
ln,  on  its  input  tape,  M  writes  the  description  of  Cn  on  its  output  tape  using  time  (space)  r(n). 

A  log-space  uniform  circuit  family  is  one  for  which  the  temporary  storage  space  used  by  a 
Turing  machine  that  generates  it  is  0(log  n),  where  n  is  the  length  of  the  input.  The  function 
/  :  B*  1— >  B*  is  computed  by  C  if  for  each  n  >  1,  /  restricted  to  n  inputs  is  the  function 
computed  by  Cn. 
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8.13.2  Uniform  Circuits  Are  Equivalent  to  Turing  Machines 

We  now  show  that  the  functions  computed  by  log-space  uniform  families  of  circuits  and  by 
polynomial-time  DTMs  are  the  same.  Since  the  family  of  functions  computed  by  one-tape 
and  multi-tape  Turing  machines  are  the  same  (see  Theorem  5.2.1),  we  prove  the  result  only 
for  the  standard  one- tape  Turing  machine  and  proper  resource  functions  (see  Section  8.3). 

THEOREM  8.13.1  Let  p(n)  be  a  polynomial  and  a  proper  function.  Then  every  total  function 
f  :  B*  i— >  B*  computed  by  a  DTM  in  time  p{n)  on  inputs  of  length  n  can  be  computed  by  a 
log-space  unifonn  circuit  family  C. 

Proof  Let  /„  :  Bn  i— >  B*  be  the  restriction  to  inputs  of  length  n  of  the  function  f  :  B*  i— > 
B*  computed  by  a  DTM  M  in  time  p(n).  It  follows  that  the  number  of  bits  in  the  word 
fn(w)  is  at  most  p(n).  Since  the  function  computed  by  a  circuit  has  a  fixed-length  output 
and  the  length  of  fn  ( w )  may  vary  for  different  inputs  w  of  length  n,  we  show  how  to  create 
a  DTM  M* ,  a  modified  version  of  M,  that  computes  /*,  a  function  that  contains  all  the 
information  in  the  function  fn.  The  value  of  /*  has  at  most  2 p(n)  bits  on  inputs  of  length 
n.  We  show  that  M*  produces  its  output  in  time  0(p2(n)). 

Let  M*  place  a  mark  in  the  2p(n)th  cell  on  its  tape  (a  cell  beyond  any  reached  during 
a  computation).  Let  it  now  simulate  M,  which  is  assumed  to  print  its  output  in  the  first 
k  locations  on  the  tape,  k  <  p(n).  M*  now  recodes  and  expands  this  binary  string  into  a 
longer  string.  It  does  so  by  marking  k  cells  to  right  of  the  output  string  (in  at  most  k2  steps), 
after  which  it  writes  every  letter  in  the  output  string  twice.  That  is,  0  appears  as  00  and  1 
as  11.  Finally,  the  remaining  2(p(n)  —  k )  cells  are  filled  with  alternating  Os  and  Is.  Clearly, 
the  value  of  /„  can  be  readily  deduced  from  the  output,  but  the  length  of  the  value  f*  is  the 
same  on  all  inputs  of  length  n. 

A  Turing  machine  Me  that  constructs  the  nth  circuit  from  n  represented  in  unary  and  a 
description  of  M*  invokes  a  slightly  revised  version  of  the  program  of  Fig.  3.27  to  construct 
the  circuit  computing  /„.  This  revised  circuit  contains  placeholders  for  the  values  of  the 
n  letters  representing  the  input  to  M.  The  program  uses  space  O (log  p2(n)),  which  is 
logarithmic  in  n.  ■ 

We  now  show  that  the  function  computed  by  a  log-space  uniform  family  of  circuits  can  be 
computed  by  a  polynomial-time  Turing  machine. 

THEOREM  8.13.2  LetC  be  a  log-space  uniform  circuit family.  Then  there  exists  a  polynomial-time 
Turing  machine  M  that  computes  the  same  set  of  functions  computed  by  the  circuits  in  C. 

Proof  Let  Me  be  the  log-space  TM  that  computes  the  circuit  family  C.  We  design  the  TM 
M  to  compute  the  same  set  of  functions  on  an  input  w  of  length  n.  M  uses  w  to  obtain  a 
unary  representation  for  the  input  Me  ■  It  uses  Me  to  write  down  a  description  of  the  nth 
circuit  on  its  work  tape.  It  then  computes  the  outputs  of  this  circuit  in  time  quadratic  in  the 
length  of  the  circuit.  Since  the  length  of  the  circuit  is  a  polynomial  in  n  because  the  circuit 
is  generated  by  a  log-space  TM  (see  Theorem  8.5.8),  the  running  time  of  M  is  polynomial 
in  the  length  of  w.  ■ 

These  two  results  can  be  generalized  to  uniform  circuit  families  and  Turing  machines  that 
use  more  than  logarithmic  space  and  polynomial  time,  respectively.  (See  Problem  8.32.) 
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In  the  above  discussion  we  examine  functions  computed  by  Turing  machines.  If  these 
functions  are  characteristic  functions,  /  :  B*  i— >  £>;  that  is,  they  have  value  0  or  1,  then 
those  strings  for  which  /  has  value  1  define  a  language  L  f .  Also,  associated  with  each  language 
L  C  B*  is  a  characteristic  function  /l  :  B*  i— >  B  that  has  value  1  on  only  those  strings  in  L. 

Consider  now  a  language  L  C  B* .  For  each  n  >  la  circuit  can  be  constructed  whose 
value  is  1  on  binary  strings  in  L  D  Bn  and  0  otherwise.  Similarly,  given  a  family  C  of  circuits 
such  that  for  each  natural  number  n  >  1  the  nth  circuit,  Cn,  computes  a  Boolean  function 
on  n  inputs,  the  language  L  associated  with  this  circuit  family  contains  only  those  strings  of 
length  n  for  which  Cn  has  value  1 .  We  say  that  L  is  recognized  by  C.  At  the  risk  of  confusion, 
we  use  the  same  name  for  a  circuit  family  and  the  languages  they  define. 

In  Theorem  8.5.6  we  show  that  NSPACE(r(n))  C  TIME(fclosrl+r(©.  We  now  use 
the  ideas  of  that  proof  together  with  the  parallel  algorithm  for  transitive  closure  given  in  Sec¬ 
tion  6.4  to  show  that  languages  in  NSPACE(r(n)),  r(n)  >  log  n,  are  recognized  by  a  uniform 
family  of  circuits  in  which  the  nth  circuit  has  size  O(fclog  ra+r(n)j  and  depth  0(r2(n)).  When 
r(n )  =  O(logn),  the  circuit  family  in  question  is  contained  in  the  class  NC2  introduced  in 
the  next  section. 

THEOREM  8. 1  3.3  If  language L  C  B*  is  in  NSPACE(r(  n)),  r(n)  >  logn,  there  exists  a  time- 
r(n )  uniform  family  of  circuits  recognizing  L  such  that  the  nth  circuit  has  size  0(fclosn+r(Tl)) 
and  depth  0(r2(n))  for  some  constant  k. 

Proof  We  assume  without  loss  of  generality  that  the  NDTM  accepting  L  has  one  accepting 
configuration.  We  then  construct  the  adjacency  matrix  for  the  configuration  graph  of  M. 
This  matrix  has  a  1  entry  in  row  i,  column  j  if  there  is  a  transition  from  the  vth  to  the 
jth  configuration.  All  other  entries  are  0.  From  the  analysis  of  Corollary  8.5.1,  this  graph 
has  O(fclos™+r'(n))  configurations.  The  initial  configuration  is  determined  by  the  word  w 
written  initially  on  the  tape  of  the  NDTM  accepting  L.  If  the  transitive  closure  of  this 
matrix  has  a  1  in  the  row  and  column  corresponding  to  the  initial  and  final  configurations, 
respectively,  then  the  word  w  is  accepted. 

From  Theorem  6.4. 1  the  transitive  closure  of  a  Boolean  pxp  matrix  A  can  be  computed 
by  computing  (/  +  A)q  for  q  >  p  —  1.  This  can  be  done  by  squaring  A  s  times  for 
s  >  log2  p.  From  this  we  conclude  that  the  transitive  closure  can  be  computed  by  a  circuit 
of  depth  0(  log2  m),  where  m  is  the  number  of  configurations.  Since  m  =  O(fclosn+r<-n-1), 
we  have  the  desired  circuit  size  and  depth  bounds. 

A  program  to  compute  the  dth  power  of  an  pxp  matrix  A  is  shown  in  Fig.  8.20.  This 
program  can  be  converted  to  one  that  writes  the  description  of  a  circuit  for  this  purpose, 
and  both  the  original  and  converted  programs  can  be  realized  in  space  O(dlogp).  (See 


trans  (A,  n,  d,  i,  j ) 
if  d  =  0  then 
return  (ajj) 
else 

return(J]J!=1  trans  (A,  n,  d  —  1,  i,  k)  *  trans  (A,  n,  d  —  1,  k,  j)  ) 


Figure  8.20  A  recursive  program  to  compute  the  dth  power  of  an  n  X  n  matrix  A. 
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Problem  8.33.)  Invoking  this  procedure  to  write  a  program  for  the  above  problem,  we  see 
that  an  0(r2(n))- depth  circuit  recognizing  L  can  be  written  by  an  0(r2(n)) -time  DTM.B 


8.14  The  Parallel  Random-Access  Machine  Model 

The  PRAM  model,  introduced  in  Section  7.9,  is  an  abstraction  of  realistic  parallel  models  that 
is  sufficiently  rich  to  permit  the  study  of  parallel  complexity  classes.  (See  Fig.  7.21,  repeated  as 
Fig.  8.21.)  The  PRAM  consists  of  a  set  of  RAM  processors  with  a  bounded  number  of  memory 
locations  and  a  common  memory.  The  words  of  the  common  memory  are  allowed  to  be  of 
unlimited  size,  but  the  instructions  that  the  RAM  processors  can  apply  to  them  are  restricted. 
These  processors  can  perform  addition,  subtraction,  vector  comparison  operations,  conditional 
branching,  and  shifts  by  fixed  amounts.  We  also  allow  load  and  store  instructions  for  moving 
words  between  registers,  local  memories,  and  the  common  memory.  These  instructions  are 
sufficiently  rich  to  compute  all  computable  functions. 

In  the  next  section  we  show  that  the  CREW  (concurrent  read/exclusive  write)  PRAM  that 
runs  in  polynomial  time  and  the  log-space  uniform  circuits  characterize  the  same  complexity 
classes.  We  then  go  on  to  explore  the  parallel  complexity  thesis,  which  states  that  sequential 
space  and  parallel  time  are  polynomially  related. 

8.14.1  Equivalence  of  the  CREW  PRAM  and  Circuits 

Because  a  parallel  machine  with  p  processors  can  provide  a  speedup  of  at  most  a  factor  of  p  over 
a  comparable  serial  machine  (see  Theorem  7.4.1),  problems  that  are  computationally  infeasi¬ 
ble  on  serial  machines  are  computationally  infeasible  on  parallel  machines  with  a  reasonable 
number  of  processors.  For  this  reason  the  study  of  parallelism  is  usually  limited  to  feasible 
problems,  that  is,  problems  that  can  be  solved  in  serial  polynomial  time  (the  class  P).  We  limit 
our  attention  to  such  problems  here. 


RAM 

RAM 

•  •  • 

RAM 

Pi 

Pi 

PP 

Figure  8.2  I  The  PRAM  consists  of  synchronous  RAMs  accessing  a  common  memory. 
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Connections  between  PRAMs  and  circuits  can  be  derived  that  are  similar  to  those  stated 
for  Turing  machines  and  circuits  in  Section  8.13.2.  In  this  section  we  consider  only  log-space 
uniform  families  of  circuits. 

Given  a  PRAM,  we  now  construct  a  circuit  simulating  it.  This  construction  is  based 
on  that  given  in  Section  3.4.  With  a  suitable  definition  of  log-space  uniform  family  of 
PRAMs  the  circuits  described  in  the  following  lemma  constitute  a  log-space  uniform  family 
of  circuits.  (See  Problem  8.35.)  Also,  this  theorem  can  be  extended  to  PRAMs  that  access 
memory  locations  with  addresses  much  larger  than  0(p(n)t(n)),  perhaps  through  indirect 
addressing.  (See  Problem  8.37.) 

LEMMA  8. 14.1  Consider  a  function  on  input  words  of  total  length  n  hits  computed  by  a  CREW 
PRAM  P  in  time  t(n)  with  a  polynomial  number  of  processors  p(n)  in  which  the  largest  common 
memory  address  is  0(p(n)t  (n) ) .  This  function  can  be  computed  by  a  circuit  of  size  0(p2(n)t  ( n ) 
+  p(n)t2(n))  and  depth  O  (log  (p(n)f(n))). 

Proof  Since  P  executes  at  most  t(n)  steps,  by  a  simple  extension  to  Problem  8.4  (only  one 
RAM  CPU  at  a  time  writes  a  word),  we  know  that  after  t (n)  steps  each  word  in  the  common 
memory  of  the  PRAM  has  length  at  most  b  =  t(n )  +  n  +  K  for  some  constant  K  >  0, 
because  the  PRAM  can  only  compare  or  add  numbers  or  shift  them  left  by  one  position  on 
each  time  step.  This  follows  because  each  RAM  CPU  uses  integers  of  fixed  length  and  the 
length  of  the  longest  word  in  the  common  memory  is  initially  n. 

We  exhibit  a  circuit  for  the  computation  by  P  by  modifying  and  extending  the  circuit 
sketched  in  Section  3.4  to  simulate  one  RAM  CPU.  This  circuit  uses  the  next-state/output 
circuit  for  the  RAM  CPU  together  with  the  next-state/output  circuit  for  the  random-access 
memory  of  Fig.  3.21  (repeated  in  Fig.  8.22).  The  circuit  of  Fig.  8.22(a)  either  writes  a  new 
value  dj  for  w*j,  the  jth  component  of  the  /th  memory  word  of  the  random-access  memory, 
or  it  writes  the  old  value  Wij .  The  circuit  simulating  the  common  memory  of  the  PRAM 
is  obtained  by  replacing  the  three  gates  at  the  output  of  the  circuit  in  Fig.  8.22(a)  with  a 
subcircuit  that  assigns  to  w*j  the  value  of  Wij  if  C;  =  0  for  each  RAM  CPU  and  the  OR  of 
the  values  of  dj  supplied  by  each  RAM  CPU  if  Q  =  1  for  some  CPU.  Flere  we  count  on  the 
fact  that  at  most  one  CPU  addresses  a  given  location  for  writing.  Thus,  if  a  CPU  writes  to 
a  location,  all  other  CPUs  cannot  do  so.  Concurrent  reading  is  simulated  by  allowing  every 
component  of  every  memory  cell  to  be  used  as  input  by  every  CPU. 

Since  the  longest  word  that  can  be  constructed  by  the  CREW  PRAM  has  length  b  = 
t(n )  +n+K ,  it  follows  from  Lemma  3.5.1  that  the  next-state/output  circuit  for  the  random- 
access  memory  designed  for  one  CPU  has  size  0(jp{n)t2{n))  and  depth  O  (log(p(n)f(ro))). 
The  modifications  described  in  the  previous  paragraph  add  size  0(p2(n)t(n))  (each  of  the 
p(n)t{n )  memory  words  has  0(p(n))  new  gates)  and  depth  0(logp(n))  (each  OR  tree 
has  p(n)  inputs)  to  this  circuit.  As  shown  at  the  end  of  Section  3.10,  the  size  and  depth 
of  a  circuit  for  the  next-state/output  circuit  of  the  CPU  are  0(t(n )  +  log (p(n)t(n)))  and 
0(logf(n)  +  loglog(p(n)f(n))),  respectively.  Since  these  sizes  and  depths  add  to  those 
for  the  common  memory,  the  total  size  and  depth  for  the  next-state/output  circuit  for  the 
PRAM  are  0(p2(n)t(n)  +  p(n)t2(n))  and  O  (log (p(n)i(n))),  respectively.  ■ 

We  now  show  that  the  function  computed  by  a  log-space  uniform  circuit  family  can  be 
computed  in  poly-logarithmic  time  on  a  PRAM. 
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(a)  (b) 

Figure  8.22  A  circuit  for  the  next-state  and  output  function  of  the  random-access  memory. 
The  circuit  in  (a)  computes  the  next  values  for  components  of  memory  words,  whereas  that  in  (b) 
computes  components  of  the  output  word.  This  circuit  is  modified  to  generate  a  circuit  for  the 
PRAM. 


LEMMA  8.14.2  Let  C  =  (Ci,  C2,  ■  ■  •}  be  a  log-space  uniform  family  of  circuits.  There  exists  a 
CREW  PRAM  that  computes  in  poly-logarithmic  time  and  a  polynomial  number  of  processors  the 
function  f  :  B*  1— >  B*  computed  by  C. 

Proof  The  CREW  PRAM  is  given  a  string  w  on  which  to  compute  the  function  /.  First 
it  computes  the  length  n  of  w.  Second  it  invokes  the  CREW  PRAM  described  below  to 
simulate  with  a  polynomial  number  of  processors  in  poly-logarithmic  time  the  log-space 
DTM  M  that  writes  a  description  of  the  nth  circuit,  C  ( M ,  n) .  Finally  we  show  that  the 
value  of  C(M,  n)  can  be  evaluated  from  this  description  by  a  CREW  PRAM  in  0(log2  n) 
steps  with  polynomially  many  processors. 

Let  M  be  a  three-tape  DTM  that  realizes  a  log-space  transformation.  This  DTM  has 
a  read-only  input  tape,  a  work  tape,  and  a  write-only  output  tape.  Given  a  string  w  on  its 
input  tape,  it  provides  on  its  output  tape  the  result  of  the  transformation.  Since  M  uses 
O(logn)  cells  on  its  work  tape  on  inputs  of  length  n,  it  can  be  modeled  by  a  finite-state 
machine  with  2°llosrl)  states.  The  circuit  C(M,n )  described  in  Theorem  3.2.2  for  the 
simulation  of  the  FSM  M  is  constructed  to  simulate  M  on  inputs  of  length  n.  We  show 
that  C (M,  n)  has  size  and  depth  that  are  polynomial  and  poly-logarithmic  in  n,  respectively. 
We  then  demonstrate  that  a  CREW  PRAM  can  simulate  C(M,  n)  (and  write  its  output  into 
its  common  memory)  in  0(log2  n)  steps  with  a  polynomial  number  of  processors. 
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From  Theorem  8.5.8  we  know  that  the  log-space  DTM  M  generating  C(M,n)  does 
not  execute  more  than  p(n)  steps,  p(n)  a  polynomial  in  n.  Since  p(n)  is  assumed  proper, 
we  can  assume  without  loss  of  generality  that  M  executes  p(n)  steps  on  all  inputs  of  length 
n.  Thus,  M  has  exactly  |<5|  =  0(p(n))  configurations. 

The  input  string  w  is  placed  in  the  first  n  locations  of  the  otherwise  blank  common 
memory.  To  determine  the  length  of  the  input,  for  each  i  the  ith  CREW  PRAM  processor 
examines  the  words  in  locations  i  and  i  +  1.  If  location  i  +  1  is  blank  but  location  i  is  not, 
i  =  n.  The  ?tth  processor  then  computes  p(n)  in  0( log2  n )  serial  steps  (see  Problem  8.2) 
and  places  it  in  common  memory. 

The  circuit  C ( M ,  n)  is  constructed  from  representations  of  next-state  mappings,  one 
mapping  for  every  state  transition.  Since  there  are  no  external  inputs  to  M  (all  inputs  are 
recorded  on  the  input  tape  before  the  computation  begins),  all  next-state  mappings  are  the 
same.  As  shown  in  Section  3.2,  let  this  one  mapping  be  defined  by  a  Boolean  \Q\  X  \Q\ 
matrix  M&  whose  rows  and  columns  are  indexed  by  configurations  of  M.  A  configuration 
of  M  is  a  tuple  (q,  h\,  h 2,  /13,  a;)  in  which  q  is  the  current  state,  h\,  h 2,  and  /13  are  the 
positions  of  the  heads  on  the  input,  output,  and  work  tapes,  respectively,  and  x  is  the  cur¬ 
rent  contents  of  the  work  tape.  Since  M  computes  a  log-space  transformation,  it  executes  a 
polynomial  number  of  steps.  Thus,  each  configuration  has  length  0(log  n).  Consequently, 
a  single  CREW  PRAM  can  determine  in  O(logn)  time  whether  an  entry  in  row  r  and 
column  c,  where  r  and  c  are  associated  with  configurations,  has  value  0  or  1.  For  concrete¬ 
ness,  assign  PRAM  processor  i  to  row  r  and  column  c  of  Ma,  where  r  =  \i/p(n)~\  and 
c  =  i  —  r  x  p(n),  quantities  that  can  be  computed  in  0(log2  n)  steps. 

The  circuit  C(M,  n)  simulating  M  is  obtained  via  a  prefix  computation  on  p(n )  copies 
of  the  matrix  Ma  using  matrix  multiplication  as  the  associative  operator.  (See  Section  3.2.) 

Once  C (M,  n)  has  been  written  into  the  common  memory,  it  can  be  evaluated  by 
assigning  one  processor  per  gate  and  then  computing  its  value  as  many  times  as  the  depth  of 
C(M,  n).  This  involves  a  four-phase  operation  in  which  the  j th  processor  reads  each  of  the 
at  most  two  arguments  of  the  jth  gate  in  the  first  two  phases,  computes  its  value  in  the  third, 
and  then  writes  it  to  common  memory  in  the  fourth.  This  process  is  repeated  as  many  times 
as  the  depth  of  the  circuit  C(M,  n),  thereby  insuring  that  correct  values  for  gates  propagate 
throughout  the  circuit.  Again  concurrent  reads  and  exclusive  writes  suffice.  ■ 

These  two  results  (and  Problem  8.37)  imply  the  result  stated  below,  namely,  that  the  bi¬ 
nary  functions  computed  by  circuits  with  polynomial  size  and  poly-logarithmic  depth  are  the 
same  as  those  computed  by  the  CREW  PRAM  with  polynomially  many  processors  and  poly- 
logarithmic  time. 

THEOREM  8. 1 4. 1  The  functions  f  :  B*  1— >  B*  computed  by  circuits  of  polynomial-size  and  poly- 
logarithmic  depth  are  the  same  as  those  computed  by  the  CREW  PRAM  with  a  polynomial  number 
of  processors  and  poly-logarithmic  time. 

8.14.2  The  Parallel  Computation  Thesis 

A  deep  connection  exists  between  serial  space  and  parallel  time.  The  parallel  computation 
thesis  states  that  sequential  space  and  parallel  time  are  polynomially  related;  that  is,  if  there 
exists  a  sequential  algorithm  that  uses  space  S,  then  there  exists  a  parallel  algorithm  using  time 
p(S)  for  some  polynomial  p  and  vice  versa.  There  is  strong  evidence  that  this  hypothesis  holds. 
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In  this  section  we  set  the  stage  for  discussing  the  parallel  computation  thesis  in  a  limited 
way  by  showing  that  every  log-space  reduction  (on  a  Turing  machine)  can  be  realized  by  a 
CREW  PRAM  in  time  O  (log2  n)  with  polynomially  many  processors.  This  implies  that  if  a 
P-complete  problem  can  be  solved  on  a  PRAM  with  polynomially  many  processors  in  poly- 
logarithmic  time,  then  so  can  every  problem  in  P,  an  unlikely  prospect. 

LEMMA  8. 1 4.3  Log-space  transformations  can  be  realized  by  CREW PRAMs  with  polynomially 
many  processors  in  time  0{  log2  n). 

Proof  We  use  the  CREW  PRAM  described  in  the  proof  of  Lemma  8.14.2.  The  processors 
in  this  PRAM  are  then  assigned  to  perform  the  matrix  operations  in  the  order  required 
for  a  parallel  prefix  computation.  (See  Section  2.6.)  If  we  assign  \Q(n)\2  processors  per 
matrix  multiplication  operation,  each  operation  can  be  done  in  0(log  \Q(n)  |2)  =  0(log  n ) 
steps.  Since  the  prefix  computation  has  depth  0(log  n),  the  PRAM  can  perform  the  prefix 
computation  in  time  0(  log2  n).  The  number  of  processors  used  is  p(n)-0(\Q(n)  |2),  which 
is  a  polynomial  in  n.  Concurrent  reads  and  exclusive  writes  suffice  for  these  operations.  ■ 

Since  a  log-space  transformation  can  be  realized  in  poly-logarithmic  time  with  polynomi¬ 
ally  many  processors  on  a  CREW  PRAM,  if  a  CREW  PRAM  solves  a  P-complete  problem  in 
poly-logarithmic  time,  we  can  compose  such  machines  to  form  a  CREW  PRAM  with  poly- 
logarithmic  time  and  polynomially  many  processors  to  solve  an  arbitrary  problem  in  P. 

THEOREM  8.14.2  If  a  P -complete  problem  can  be  solved  in  poly-logaritbmic  time  with  polyno¬ 
mially  many  processors  on  a  CREW  PRAM,  then  so  can  all  problems  in  P  and  all  problems  in  P 
are  fidly  parallelizable. 

8.15  Circuit  Complexity  Classes 

In  this  section  we  introduce  several  important  circuit  complexity  classes  including  NC,  the 
languages  recognized  by  uniform  families  of  circuits  whose  size  and  depth  are  polynomial  and 
poly-logarithmic  in  n,  respectively,  and  P/poly,  the  largest  set  of  languages  L  C  B*  with  the 
property  that  L  is  recognized  by  a  (non-uniform)  circuit  family  of  polynomial  size.  We  also 
derive  relationships  among  these  classes  and  previously  defined  classes. 

8.15.1  Efficiently  Parallelizable  Languages 

DEFINITION  8. 1  5. 1  The  class  NCfe  contains  those  languages  L  recognized  by  a  uniform  family  of 
Boolean  circuits  of  polynomial  size  and  depth  0(logfc  n)  in  n,  the  length  of  an  input.  The  class 
NC  is  the  union  of  the  classes  NCfc,  k  >  1;  that  is, 

NC  =  1J  NCfc 

fc>i 

In  Section  8.14  we  explored  the  connection  between  circuit  size  and  depth  and  PRAM 
time  and  number  of  processors  and  concluded  that  circuits  having  polynomial  size  and  poly- 
logarithmic  depth  compute  the  same  languages  as  do  PRAMs  with  a  polynomial  number  of 
processors  and  poly-logarithmic  parallel  time. 
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The  class  NC  is  considered  to  be  the  largest  feasibly  parallelizable  class  of  languages.  By  fea¬ 
sible  we  mean  that  the  number  of  gates  (equivalently  processors)  is  no  more  than  polynomial 
in  the  length  n  of  the  input  and  by  parallelizable  we  mean  that  circuit  depth  (equivalently 
computation  time)  must  be  no  more  than  poly-logarithmic  in  n.  Feasibly  parallelizable  lan¬ 
guages  meet  both  requirements. 

The  prefix  circuits  introduced  in  Section  2.6  belong  to  NC1,  as  do  circuits  constructed 
with  prefix  operations,  such  as  binary  addition  and  subtraction  (see  Section  2.7)  and  the  cir¬ 
cuits  for  solutions  of  linear  recurrences  (see  Problem  2.24).  (Strictly  speaking,  these  functions 
are  not  predicates  and  do  not  define  languages.  However,  comparisons  between  their  values 
and  a  threshold  converts  them  to  predicates.  In  this  section  we  liberally  mix  functions  and 
predicates.)  The  class  NC1  also  contains  functions  associated  with  integer  multiplication  and 
division. 

The  fast  Fourier  transform  (see  Section  6.7.3)  and  merging  networks  (see  Section  6.8)  can 
both  be  realized  by  algebraic  and  combinatorial  circuits  of  depth  O(logn),  where  n  is  the 
number  of  circuit  inputs.  If  the  additions  and  multiplications  of  the  FFT  are  done  over  a  ring 
of  integers  modulo  m  for  some  m,  the  FFT  can  be  realized  by  a  circuit  of  depth  0( log2  n).  If 
the  items  to  be  merged  are  represented  in  binary,  a  comparison  operator  can  be  realized  with 
depth  O(logn)  and  merging  can  also  be  done  with  a  circuit  of  depth  0( log2  n).  Thus,  both 
problems  are  in  NC2. 

When  matrices  are  defined  over  a  field  of  characteristic  zero,  the  inverse  of  invertible  ma¬ 
trices  (see  Section  6.5.5)  can  be  computed  by  an  algebraic  circuit  of  depth  0( log2  n).  If  the 
matrix  entries  when  represented  as  binary  numbers  have  size  n,  the  ring  operations  may  be 
realized  in  terms  of  binary  addition  and  multiplication,  and  matrix  inversion  is  in  NC3. 

Also,  it  follows  from  Theorem  8.13.3  that  the  ??th  circuit  in  the  log-space  uniform  families 
of  circuits  has  polynomial  size  and  depth  0(log2n);  that  is,  it  is  contained  in  NC2.  Also 
contained  in  this  set  is  the  transitive  closure  of  a  Boolean  matrix  (see  Section  6.4).  Since  the 
circuits  constructed  in  Chapter  3  to  simulate  finite-state  machines  as  well  as  polynomial-time 
Turing  machines  are  log-space  uniform  (see  Theorem  8.13.1),  each  of  these  circuit  families  is 
in  NC2. 

We  now  relate  these  complexity  classes  to  one  another  and  to  P. 

THEOREM  8. 1  5. 1  For  k  >  2,  NC1  CLCNLC  NC2  C  NCfc  CNCCP. 


Proof  The  containment  L  C  NL  is  obvious.  The  containment  NL  C  NC2  is  a  restriction 
of  the  result  of  Theorem  8.13.3  to  r(n)  =  O(logn).  The  containments  NC2  C  NCfc  C 
NC  follow  from  the  definitions.  The  last  containment,  NC  C  P,  is  a  consequence  of  the 
fact  that  the  circuit  on  n  inputs  in  a  log-space  uniform  family  of  circuits,  call  it  Cn,  can 
be  generated  in  polynomial  time  by  a  Turing  machine  that  can  then  evaluate  Cn  in  a  time 
quadratic  in  its  length,  that  is,  in  polynomial  time.  (Theorems  8.5.8  and  8.13.2  apply.) 

The  first  containment,  namely  NC1  C  L,  is  slightly  more  difficult  to  establish.  Given  a 
language  L  £  NC1 ,  consider  the  problem  of  recognizing  whether  or  not  a  string  w  is  in  L. 
This  recognition  task  is  done  in  log-space  by  invoking  two  log-space  transformations,  as  is 
now  explained. 

The  first  log-space  transformation  generates  the  nth  circuit,  Cn,  in  the  family  recogniz¬ 
ing  L.  Cn  has  value  1  if  w  is  in  L  and  0  otherwise.  By  definition,  Cn  has  size  polynomial 
in  n.  Also,  each  circuit  is  described  by  a  straight-line  program,  as  explained  in  Section  2.2. 
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The  second  log-space  transformation  evaluates  the  circuit  with  temporary  work  space 
proportional  to  the  maximal  length  of  such  strings.  If  the  strings  identifying  gates  have 
larger  length,  their  transformation  would  use  more  space.  (Note  that  it  is  easy  to  identify 
gates  with  an  0(log2  n)  -length  string(s)  by  concatenating  the  number  of  each  gate  on  the 
path  to  it,  including  itself.)  For  this  reason  we  give  an  efficient  encoding  of  gate  locations. 

The  gates  of  circuits  in  NC1  generally  have  fan-out  exceeding  1 .  That  is,  they  have  more 
than  one  parent  gate  in  the  circuit.  We  describe  how  to  identify  gates  with  strings  that  may 
associate  multiple  strings  with  a  gate.  We  walk  the  graph,  which  is  the  circuit,  starting  from 
the  output  vertex  and  moving  toward  input  vertices.  The  output  gate  is  identified  with  the 
empty  string  string  e.  If  we  reach  a  gate  g  via  a  parent  whose  string  is  p,  g  is  identified  by 
pO  or  pi.  If  the  parent  has  only  one  descendant,  as  would  be  the  case  for  NOT  gates  and 
inputs,  we  represent  g  by  pO.  If  it  has  two  descendants,  as  would  be  the  case  for  AND  and 
OR,  and  g  has  the  smaller  gate  number,  its  string  is  pO;  otherwise  it  is  pi. 

The  algorithm  to  produce  each  of  these  binary  strings  can  be  executed  in  logarithmic 
space  because  one  need  only  walk  each  path  in  the  circuit  from  the  output  to  inputs.  The 
tuple  defining  each  gate  contains  the  gate  numbers  of  its  predecessors,  O  (log  n)  -length 
numbers,  and  the  algorithm  need  only  carry  one  such  number  at  a  time  in  its  working  mem¬ 
ory  to  find  the  location  of  a  predecessor  gate  in  the  input  string  containing  the  description 
of  the  circuit. 

The  second  log-space  transformation  evaluates  the  circuit  using  the  binary  strings  de¬ 
scribing  the  circuit.  It  visits  the  input  vertex  with  the  lexicographically  smallest  string  and 
determines  its  value.  It  then  evaluates  the  gate  whose  string  is  that  of  the  input  vertex  minus 
the  last  bit.  Even  though  it  may  have  to  revisit  all  gates  on  the  path  to  this  vertex  to  do  this, 
O(logn)  space  is  used.  If  this  gate  is  either  a)  AND  and  the  input  has  value  0,  b)  OR  and 
the  input  has  value  1,  or  c)  NOT,  the  value  of  the  gate  is  decided.  If  the  gate  has  more  than 
one  input  and  its  value  is  not  decided,  the  other  input  to  it  is  evaluated  (the  one  with  suffix 
1).  Because  the  second  input  to  the  gate  is  evaluated  only  if  needed,  its  value  determines 
the  value  of  the  gate.  This  process  is  repeated  at  each  gate  in  the  circuit  until  the  output 
gate  is  reached  and  its  value  computed.  Since  this  procedure  keeps  only  one  path  of  length 
0(log  n)  active  at  a  time,  the  algorithm  uses  space  0(log n).  ■ 

An  important  open  question  is  whether  the  complexity  hierarchy  of  this  theorem  collapses 
and,  if  so,  where.  For  example,  is  it  true  that  a  problem  in  P  is  also  in  NC?  If  so,  all  serial 
polynomial-time  problems  are  parallelizable  with  a  number  of  processing  elements  polynomial 
in  the  length  of  the  input  and  poly-logarithmic  time,  an  unlikely  prospect. 

8.15.2  Circuits  of  Polynomial  Size 

We  now  examine  the  class  of  languages  P/poly  and  show  that  they  are  exactly  the  languages 
recognized  by  Boolean  circuits  of  polynomial  size.  To  set  the  stage  we  introduce  advice  and 
pairing  functions. 

DEFINITION  8.15.2  An  advice  function  a  :  N  h  8*  maps  natural  numbers  to  binary  strings. 
A  polynomial  advice  function  is  an  advice  function  for  which  |a(n)|  <  p(n)  for  p(n)  a 
polynomial  function  in  n. 

DEFINITION  8.15.3  A  pairing  function  <,  >:  B*  x  B*  i— >  B*  encodes  pairs  of  binary  strings 
x  and  y  with  two  end  markers  and  a  separator  (a  comma)  into  the  binary  string  <  x,y  >. 
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Pairing  functions  can  be  very  easy  to  describe  and  compute.  For  example,  <  x,  y  >  can 
be  implemented  by  representing  0  by  01,  1  by  10,  both  <  and  >  by  1 1,  and  ,  (comma)  by  00. 
Thus,  <  0010,  110  >  is  encoded  as  11010110010010100111.  It  is  clearly  trivial  to  identify, 
extract,  and  decode  each  component  of  the  pair.  We  are  now  prepared  to  define  P/poly. 

DEFINITION  8.15.4  Let  a  :  IN  B*  be  a  polynomial  advice  function.  P/poly  is  the  set  of 
languages  L  =  {«?  |  <  w,  a(|to|)  >  £  A}  for  which  there  is  a  language  A  in  P. 

The  advice  a(|wt|)  given  on  a  string  to  in  a  language  L  £  P/poly  is  the  same  for  all 
strings  of  the  same  length.  Furthermore,  <  w,  a(|tt?|)  >  must  be  easy  to  recognize,  namely, 
recognizable  in  polynomial  time. 

The  subset  of  the  languages  in  P/poly  for  which  the  advice  function  is  the  empty  string  is 
exactly  the  languages  in  P,  that  is,  P  C  P/poly. 

The  following  result  is  the  principal  result  of  this  section.  It  gives  two  different  interpreta¬ 
tions  of  the  advice  given  on  strings. 

THEOREM  8.15.2  A  language  L  is  recognizable  by  a  family  of  circuits  of  polynomial  size  if  and 
only  ifLd  P/poly. 

Proof  Let  L  be  recognizable  by  a  family  C  of  circuits  of  polynomial  size.  We  show  that  it  is 

in  P/poly. 

Let  Cn  be  an  encoding  of  the  circuit  Cn  in  C  that  recognizes  strings  in  L  [~l  Bn .  Let  the 
advice  function  a(n)  =  Cn  and  let  w  £  B*  have  length  n.  Then,  w  £  Bn  if  and  only  if 
the  value  of  Cn  on  kj  is  1 .  Since  w  has  length  polynomial  in  n,  w  £  Bn  if  and  only  if  the 
pairing  function  <  w,  a(|ut|)  >  is  an  instance  of  CIRCUIT  SAT,  which  has  been  shown  to 
be  in  P.  (See  Theorem  8.13.2.) 

On  the  other  hand,  suppose  that  L  £  P/poly.  We  show  that  L  is  recognizable  by  circuits 
of  polynomial  size.  By  definition  there  is  an  advice  function  a  :  IN  i— >  B*  and  a  language 
A  £  P  for  L  such  that  for  all  w  £  L,  <  w,a(\w\)  >  £  A.  Since  A  £  P,  there  is  a 
polynomial-time  DTM  that  accepts  <  w,  a(|iu|)  >.  By  Theorem  8.13.1  there  is  a  circuit 
of  polynomial  size  that  recognizes  <  w,  a(|tu|)  >.  The  string  a(|ut|)  is  constant  for  strings 
w  of  length  n.  Thus,  the  circuit  for  A  H  Bn  to  which  is  supplied  the  constant  string  a(|tu|) 
is  a  circuit  of  length  polynomial  in  n  that  accepts  strings  w  in  L.  ■ 


Problems 

MATHEMATICAL  PRELIMINARIES 

8.1  Show  that  if  strings  over  an  alphabet  A  with  at  least  two  letters  are  encoded  over  a 
one-letter  alphabet  (a  unary  encoding),  then  strings  of  length  n  over  A  require  strings 
of  length  exponential  in  n  in  the  unary  encoding. 

8.2  Show  that  the  polynomial  function p(n)  =  K\nk  can  be  computed  in  0( log2  n)  serial 
steps  from  n  and  for  constants  K i  >  1  and  k  >  1  on  a  RAM  when  additions  require 
one  unit  of  time. 
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SERIAL  COMPUTATIONAL  MODELS 

8.3  Given  an  instance  of  satisfiability,  namely,  a  set  of  clauses  over  a  set  of  literals  and  values 
for  the  variables,  show  that  the  clauses  can  be  evaluated  in  time  quadratic  in  the  length 
of  the  instance. 

8.4  Consider  the  RAM  of  Section  8.4.1.  Let  1{I)  be  the  length,  measured  in  bits,  of  the 
contents  X  of  the  RAMs  input  registers.  Similarly,  let  l(v)  be  the  maximal  length  of  any 
integer  addressed  by  an  instruction  in  the  RAMs  program.  Show  that  after  k  steps  the 
contents  of  any  RAM  memory  location  is  at  most  k  +  /(I)  +  l(v). 

Given  an  example  of  a  computation  that  produces  a  word  of  length  k. 

Hint:  Consider  which  instructions  have  the  effect  of  increasing  the  length  of  an  integer 
used  or  produced  by  the  RAM  program. 

8.5  Consider  the  RAM  of  Section  8.4.1.  Assume  the  RAM  executes  T  steps.  Describe  a 
Turing-machine  simulation  of  this  RAM  that  uses  space  proportional  to  T2  measured 
in  bits. 

Hint:  Represent  each  RAM  memory  location  visited  during  a  computation  by  an 
(address,  contents)  pair.  When  a  RAM  location  is  updated,  fill  the  cells  on  the 
second  tape  containing  the  old  (address,  contents)  pair  with  a  special  “blank”  char¬ 
acter  and  add  the  new  (address,  contents)  pair  to  the  end  of  the  list  of  such  pairs. 
Use  the  results  of  Problem  8.4  to  bound  the  length  of  individual  words. 

8.6  Consider  the  RAM  of  Section  8.4.1.  Using  the  result  of  Problem  8.5,  describe  a  multi¬ 
tape  Turing  machine  that  simulates  in  0(T3)  steps  a  T-step  computation  by  the  RAM. 
Hint:  Let  your  machine  have  seven  tapes:  one  to  hold  the  input,  a  second  to  hold 
the  contents  of  RAM  memory  recorded  as  (address,  contents)  pairs  separated  and 
terminated  by  appropriate  markers,  a  third  to  hold  the  current  value  of  the  program 
counter,  a  fourth  to  hold  the  memory  address  being  sought,  and  three  tapes  for  operands 
and  results.  On  the  input  tape  place  the  program  to  be  executed  and  the  input  on  which 
it  is  to  be  executed.  Handle  the  second  tape  as  suggested  in  Problem  8.5.  When  per¬ 
forming  an  operation  that  has  two  operands,  place  them  on  the  fifth  and  sixth  tapes 
and  the  result  on  the  seventh  tape. 

8.7  Justify  using  the  number  of  tape  cells  as  a  measure  of  space  for  the  Turing  machine 
when  the  more  concrete  measure  of  bits  is  used  for  the  space  measure  for  the  RAM. 

CLASSIFICATION  OF  DECISION  PROBLEMS 

8.8  Given  a  Turing  machine,  deterministic  or  not,  show  that  there  exists  another  Turing 
machine  with  a  larger  tape  alphabet  that  performs  the  same  computation  but  in  a  num¬ 
ber  of  steps  and  number  of  tape  cells  that  are  smaller  by  constant  factors. 

8.9  Show  that  strings  in  TRAVELING  SALESPERSON  can  be  accepted  by  a  deterministic 
Turing  machine  in  an  exponential  number  of  steps. 

COMPLEMENTS  OF  COMPLEXITY  CLASSES 

8.10  Show  that  VALIDITY  is  log-space  complete  for  coNP. 

8.1 1  Prove  that  the  complements  of  NP-complete  problems  are  coNP-complete. 
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8.12  Show  that  the  complexity  class  P  is  contained  in  the  intersection  of  NP  and  coNP. 

8.13  Demonstrate  that  coNP  C  PSPACE. 

8.14  Prove  that  if  a  coNP-complete  problem  is  in  NP,  then  NP  =  coNP. 

REDUCTIONS 

8.15  If  V\  and  V2  are  decision  problems,  a  Turing  reduction  from  V\  to  V2  is  any  OTM 
that  solves  V\  given  an  oracle  for  V2 ■  Show  that  the  reductions  of  Section  2.4  are 
Turing  reductions. 

8.16  Prove  that  the  reduction  given  in  Section  10.9.1  of  a  pebble  game  to  a  branching  com¬ 
putation  is  a  Turing  reduction.  (See  Problem  8.15.) 

8.17  Show  that  if  a  problem  V\  can  be  Turing-reduced  to  problem  V2  by  a  polynomial-time 
OTM  and  V2  is  in  P,  then  'Pi  is  also  in  P. 

Hint:  Since  each  invocation  of  the  oracle  can  be  done  deterministically  in  polynomial 
time  in  the  length  of  the  string  written  on  the  oracle  tape,  show  that  it  can  be  done  in 
time  polynomial  in  the  length  of  the  input  to  the  OTM. 

8.18  a)  Show  that  every  fixed  power  of  an  integer  written  as  a  binary  fc-tuple  can  be  com¬ 

puted  by  a  DTM  in  space  0(k). 

b)  Show  that  every  fixed  polynomial  in  an  integer  written  as  a  binary  fc-tuple  can  be 
computed  by  a  DTM  in  space  0(k). 

Hint:  Show  that  carry-save  addition  can  be  used  to  multiply  two  fc-bit  integers  with 
work  space  0(k). 

HARD  AND  COMPLETE  PROBLEMS 

8.19  The  class  of  polynomial-time  Turing  reductions  are  Turing  reductions  in  which  the 
OTM  runs  in  time  polynomial  in  the  length  of  its  input.  Show  that  the  class  of  Turing 
reductions  is  transitive. 

P-COMPLETE  PROBLEMS 

8.20  Show  that  numbers  can  be  assigned  to  gates  in  an  instance  of  MONOTONE  CIRCUIT 
VALUE  that  corresponds  to  an  instance  of  CIRCUIT  VALUE  in  Theorem  8.9.1  so  that 
the  reduction  from  it  to  MONOTONE  CIRCUIT  VALUE  can  be  done  in  logarithmic 
space. 

8.21  Prove  that  LINEAR  PROGRAMMING  described  below  is  P-complete. 

LINEAR  PROGRAMMING 

Instance:  Integer-valued  m  x  n  matrix  A  and  column  m-vectors  b  and  c. 

Answer:  “Yes”  if  there  is  a  rational  column  ?i-vector  x  >  0  such  that  Ax  <  b  and  x 
T 

maximizes  c  x. 

NP-COMPLETE  PROBLEMS 

8.22  A  Horn  clause  has  at  most  one  positive  literal  (an  instance  of Every  other  literal 
in  a  Horn  clause  is  a  negative  literal  (an  instance  of  xi).  HORN  SATISFIABILITY  is  an 
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instance  of  SATISFIABILITY  in  which  each  clause  is  a  Horn  clause.  Show  that  HORN 
SATISFIABILITY  is  in  P. 

Hint:  If  all  literals  in  a  clause  are  negative,  the  clause  is  satisfied  only  if  some  associated 
variables  have  value  0.  If  a  clause  has  one  positive  literal,  say  y,  and  negative  literals,  say 
X\ ,  X2,  ■  ■  ■ ,  Xk,  then  the  clause  is  satisfied  if  and  only  if  the  implication  X\  A  X2  A  •  •  •  A 
Tfe  =>  y  is  true.  Thus,  y  has  value  1  when  each  of  these  variables  has  value  1.  Let  T 
be  a  set  variables  that  must  have  value  1 .  Let  T  contain  initially  all  positive  literals  that 
appear  alone  in  a  clause.  Cycle  through  all  implications  and  for  each  implication  all 
of  whose  left-hand  side  variables  appear  in  T  but  whose  right-hand  side  variable  does 
not,  add  this  variable  to  T.  Since  T  grows  until  all  left-hand  sides  are  satisfied,  this 
procedure  terminates.  Show  that  all  satisfying  assignments  contain  T. 

8.23  Describe  a  polynomial-time  algorithm  to  determine  whether  an  instance  of  CIRCUIT 
SAT  is  a  “yes”  instance  when  the  circuit  in  question  consists  of  a  layer  of  AND  gates 
followed  by  a  layer  of  OR  gates.  Inputs  are  connected  to  AND  gates  and  the  output  gate 
is  an  OR  gate. 

8.24  Prove  that  the  CLIQUE  problem  defined  below  is  NP-complete. 

CLIQUE 

Instance:  The  description  of  an  undirected  graph  G  =  (V,  E )  and  an  integer  k. 
Answer:  “Yes”  if  there  is  a  set  of  k  vertices  of  G  such  that  all  vertices  are  adjacent. 

8.25  Prove  that  the  HALF  CLIQUE  problem  defined  below  is  NP-complete. 

HALF  CLIQUE 

Instance:  The  description  of  an  undirected  graph  G  =  (V,  E)  in  which  |  V|  is  even  and 
an  integer  k. 

Answer:  “Yes”  if  G  contains  a  clique  on  |  V|/2  vertices  or  has  more  than  k  edges. 

Hint:  Try  reducing  an  instance  of  CLIQUE  on  a  graph  with  m  vertices  and  a  clique  of 
size  k  to  this  problem  by  expanding  the  number  of  vertices  and  edges  to  create  a  graph 
that  has  \  V\  >  m  vertices  and  a  clique  of  size  |  V|/2.  Show  that  a  test  for  the  condition 
that  G  contains  more  than  k  edges  can  be  done  very  efficiently  by  counting  the  number 
of  bits  among  the  variables  describing  edges. 

8.26  Show  that  the  NODE  COVER  problem  defined  below  is  NP-complete. 

NODE  COVER 

Instance:  The  description  of  an  indirected  graph  G  =  (V,  E)  and  an  integer  k. 

Answer:  “Yes”  if  there  is  a  set  of  k  vertices  such  that  every  edge  contains  at  least  one  of 
these  vertices. 

8.27  Prove  that  the  HAMILTONIAN  PATH  decision  problem  defined  below  is  NP-complete. 
HAMILTONIAN  PATH 

Instance:  The  description  of  an  undirected  graph  G. 

Answer:  “Yes”  if  there  is  a  path  visiting  each  node  once. 

Hint:  3  -SAT  can  be  reduced  to  HAMILTONIAN  PATH,  but  the  construction  is  chal¬ 
lenging.  First,  add  literals  to  clauses  in  an  instance  of  3 -SAT  so  that  each  clause  has 
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(a) 


(b) 


Figure  8.23  Gadgets  used  to  reduce  3-SAT  to  HAMILTONIAN  PATH. 


(c) 


three  literals.  Second,  construct  and  interconnect  three  types  of  subgraphs  (gadgets). 
Figures  8.23(a)  and  (b)  show  the  first  and  second  of  theses  gadgets,  G \  and  Gj,. 

There  is  one  first  gadget  for  each  variable  x2,  1  <  i  <  n,  denoted  G©.  The  left  path 
between  the  two  middle  vertices  in  G\ ^  is  associated  with  the  value  Xi  =  1  and  the 
right  path  is  associated  with  the  complementary  value,  Xi  =  0.  Vertex  /  of  G\yi  is 
identified  with  vertex  e  of  Gi^+i  for  1  <  *  <  n  —  1,  vertex  e  of  G\y\  is  connected  only 
to  a  vertex  in  Gij,  and  vertex  /  of  G\yU  is  connected  to  the  clique  described  below. 
There  is  one  second  gadget  for  each  literal  in  each  clause.  Thus,  if  x.j  ( Xi )  is  a  literal  in 
clause  Cj,  then  we  create  a  gadget  G2,j,i,\  ( £©,1,0 )• 

Since  a  HAMILTONIAN  PATH  touches  every  vertex,  a  path  through  G2,j,i,v  for  V  € 
{0,  1}  passes  either  from  a  to  c  or  from  b  to  d. 

For  each  l  <  i  <  n  the  two  parallel  edges  of  G\yi  are  broken  open  and  two  vertices 
appear  in  each  of  them.  For  each  instance  of  the  literal  Xi  (Ti),  connect  the  vertices  a 
and  c  of  G2,j,i,\  (G2,j,i, 0)  to  the  pair  of  vertices  on  the  left  (right)  that  are  created  in 
G\yi.  Connect  the  b  vertex  of  one  literal  in  clause  Cj  to  the  d  vertex  of  another  one,  as 
suggested  in  Fig.  8.23(c). 

The  third  gadget  has  vertices  g  and  h  and  a  connecting  edge.  One  of  these  two  vertices, 
h,  is  connected  in  a  clique  with  the  b  and  d  vertices  of  the  gadgets  G2,j,i,v  and  the  / 
vertex  of  Gijn. 

This  graph  has  a  Hamiltonian  path  between  g  and  the  e  vertex  of  G\y\  if  and  only  if 
the  instance  of  3-SAT  is  a  “yes”  instance. 

8.28  Show  that  the  TRAVELING  SALESPERSON  decision  problem  defined  below  is  NP- 
complete. 

TRAVELING  SALESPERSON 

Instance:  An  integer  k  and  a  set  of  n(n  —  l)/2  distances  {di,2>  di,3,  ■  •  ■ ,  d\<n,  <^2,3, . . . , 
d2>n>  •  ■  • ,  dn- !,„}  between  n  cities. 

Answer:  “Yes”  if  there  is  a  tour  (an  ordering)  {i\,  12, ,  in}  of  the  cities  such  that  the 
length  l  =  diui2  +  di2,i3  +  •  •  •  +  of  the  tour  satisfies  l  <  k. 

Hint:  Try  reducing  HAMILTONIAN  PATH  to  TRAVELING  SALESPERSON. 
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8.29  Give  a  proof  that  the  PARTITION  problem  defined  below  is  NP-complete. 

PARTITION 

Instance:  A  set  Q  =  {ai,  a2,  ■  ■  . ,  an}  of  positive  integers. 

Answer:  “Yes”  if  there  is  a  subset  of  Q  that  adds  to  \  <i<n  ai- 

PSPACE-COMPLETE  PROBLEMS 

8.30  Show  that  the  procedure  tree_eval  described  in  the  proof  of  Theorem  8.12.1  can 
be  modified  slightly  to  apply  to  the  evaluation  of  the  trees  generated  in  the  proof  of 
Theorem  8.12.3. 

Hint:  A  vertex  of  in-degree  k  can  be  replaced  by  a  binary  tree  of  k  leaves  and  depth 
log2  k. 


THE  CIRCUIT  MODEL  OF  COMPUTATION 

8.31  Prove  that  the  class  of  circuits  described  in  Section  3.1  that  simulate  a  finite-state  ma¬ 
chine  are  uniform. 

8.32  Generalize  Theorems  8.13.1  and  8.13.2  to  uniform  circuit  families  and  Turing  ma¬ 
chines  that  use  more  than  logarithmic  space  and  polynomial  time,  respectively. 

8.33  Write  a  0(log2  n)-space  program  based  on  the  one  in  Fig.  8.20  to  describe  a  circuit  for 
the  transitive  closure  of  an  n  x  n  matrix  based  on  matrix  squaring. 

THE  PARALLEL  RANDOM-ACCESS  MACHINE  MODEL 

8.34  Complete  the  proof  of  Lemma  8.14.2  by  making  specific  assignments  of  data  to  mem¬ 
ory  locations.  Also,  provide  formulas  for  the  assignment  of  processors  to  tasks. 

8.35  Give  a  definition  of  a  log-space  uniform  family  of  PRAMs  for  which  Lemma  8. 14. 1 
can  be  extended  to  show  that  the  function  /  :  B*  i— >  B*  computed  by  a  log-space  fam¬ 
ily  of  PRAMs  can  also  be  computed  by  a  log-space  uniform  family  of  circuits  satisfying 
the  conditions  of  Lemma  8.14.1. 

8.36  Exhibit  a  non-uniform  family  of  PRAMs  that  can  solve  problems  that  are  not  recur¬ 
sively  enumerable. 

8.37  Lemma  8. 14. 1  is  stated  for  PRAMs  in  which  the  CPU  does  not  access  a  common  mem¬ 
ory  address  larger  than  0(jp(n)t(n)).  In  particular,  this  model  does  not  permit  indirect 
addressing.  Show  that  this  theorem  can  be  extended  to  RAM  CPUs  that  do  allow 
indirect  addressing  by  using  the  representation  for  memory  accesses  in  Problem  8.6. 

Chapter  Notes 

The  classification  of  languages  by  the  resources  needed  for  their  recognition  is  a  very  large 
subject  capable  of  book-length  study  The  reader  interested  in  going  beyond  the  introduc¬ 
tion  given  here  is  advised  to  consult  one  of  the  readily  available  references.  The  Handbook  of 
Theoretical  Computer  Science  contains  three  survey  articles  on  this  subject  by  van  Embde  Boas 
[350],  Johnson  [151],  and  Karp  and  Ramachandran  [161] 
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The  first  examines  simulation  of  one  computational  model  by  another  for  a  large  range  of 
models.  The  second  provides  a  large  catalog  of  complexity  classes  and  relationships  between 
them.  The  third  examines  parallel  algorithms  and  complexity.  Other  sources  for  more  infor¬ 
mation  on  this  topic  are  the  books  by  Hopcroft  and  Ullman  [141],  Lewis  and  Papadimitriou 
[200],  Balcazar,  Diaz,  and  Gabarro  on  structural  complexity  [27],  Garey  and  Johnson  [109] 
on  the  theory  of  NP-completeness,  Greenlaw,  Hoover,  and  Ruzzo  [120]  on  P-completeness, 
and  Papadimitriou  [235]  on  computational  complexity. 

The  Turing  machine  was  defined  by  Alan  Turing  in  1936  [338],  as  was  the  oracle  Turing 
machine.  Random-access  machines  were  introduced  by  Shepherdson  and  Sturgis  [308]  and 
the  performance  of  RAMs  was  analyzed  by  Cook  and  Reckhow  [77]. 

Hartmanis,  Lewis,  and  Stearns  [127,128]  gave  the  study  of  time  and  space  complexity 
classes  its  impetus.  Their  papers  contain  many  of  the  basic  theorems  on  complexity  classes, 
including  the  space  and  time  hierarchy  theorems  stated  in  Section  8.5.1.  The  gap  theorem 
was  obtained  by  Trakhtenbrot  [334]  and  rediscovered  by  Borodin  [51].  Blum  [46]  developed 
machine-independent  complexity  measures  and  established  a  speedup  theorem  showing  that 
for  some  languages  there  is  no  single  fastest  recognition  algorithm  [47]. 

Many  individuals  identified  and  recognized  the  importance  of  the  classes  P  and  NP.  Cook 
[74]  formalized  NP,  emphasized  the  importance  of  polynomial-time  reducibility,  and  exhib¬ 
ited  the  first  NP-complete  problem,  SATISFIABILITY.  Karp  [159]  then  demonstrated  that 
a  number  of  other  combinatorial  problems,  including  TRAVELING  SALESPERSON,  are  NP- 
complete.  Cook  used  Turing  reductions  in  his  classification  whereas  Karp  used  polynomial¬ 
time  transformations.  Independently  and  almost  simultaneously  Levin  [199]  (see  also  [335]) 
was  led  to  concepts  similar  to  the  above. 

The  relationship  between  nondeterministic  and  deterministic  space  (Theorem  8.5.5  and 
Corollary  8.5.1)  was  established  by  Savitch  [297].  The  proof  that  nondeterministic  space 
classes  are  closed  under  complementation  (Theorem  8.6.2  and  Corollary  8.6.2)  is  indepen¬ 
dently  due  to  Szelepscenyi  [322]  and  Immerman  [145]. 

Theorem  8.6.4,  showing  that  PRIMALITY  is  in  NP  (~l  coNP,  is  due  to  Pratt  [257]. 

Cook  [75]  defined  the  concept  of  a  P-complete  problem  and  exhibited  the  first  such  prob¬ 
lem.  He  was  followed  quickly  by  Jones  and  Laaser  [153]  and  Galil  [108] .  Ladner  [185]  showed 
that  circuits  simulating  Turing  machines  (see  [286])  could  be  constructed  in  logarithmic  space, 
thereby  establishing  that  CIRCUIT  VALUE  is  P-complete.  Goldschlager  [117]  demonstrated 
that  MONOTONE  CIRCUIT  VALUE  is  P-complete.  Valiant  [345]  and  Cook  established  that 
LINEAR  INEQUALITIES  is  P-hard,  and  Khachian  [165]  showed  that  this  problem  is  in  P.  The 
proof  that  DTM  ACCEPTANCE  is  P-complete  is  due  to  Johnson  [151]. 

Cook  [74]  gave  the  first  proof  that  SATISFIABILITY  is  NP-complete  and  also  gave  the 
reduction  to  3-SAT.  Independently,  Levin  [199]  (see  also  [335])  was  led  to  similar  concepts 
for  combinatorial  problems.  Schafer  [299]  showed  that  NAESAT  is  NP-complete.  Karp  [159] 
established  that  0-1  INTEGER  PROGRAMMING,  3-COLORING,  EXACT  COVER,  SUBSET 
SUM,  TASK  SEQUENCING,  and  INDEPENDENT  SET  are  NP-complete. 

The  proof  that  2-SAT  is  in  NL  (Theorem  8.1 1.1)  is  found  in  Papadimitriou  [235]. 

Karp  [159]  exhibited  a  PSPACE-complete  problem,  Meyer  and  Stockmeyer  [316]  demon¬ 
strated  that  QUANTIFIED  SATISFIABILITY  is  PSPACE-complete  and  Schafer  established  that 
GENERALIZED  GEOGRAPHY  is  PSPACE-complete  [299]. 

The  notion  of  a  uniform  circuit  was  introduced  by  Borodin  [52]  and  has  been  examined  by 
many  others.  (See  [120].)  Borodin  [52]  established  the  connection  between  nondeterministic 
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space  and  circuit  depth  stated  in  Theorem  8.13.3.  Stockmeyer  and  Vishkin  [317]  show  how 
to  simulate  efficiently  the  PRAM  with  circuits  and  vice  versa.  (See  also  [161].)  The  class  NC 
was  defined  by  Cook  [76].  Theorem  8.15.2  is  due  to  Pippenger  [249].  The  class  P/poly  and 
Theorem  8.15.2  are  due  to  Karp  and  Lipton  [160]. 

A  large  variety  of  parallel  computational  models  have  been  developed.  (See  van  Embde 
Boas  [350]  and  Greenlaw,  Hoover,  and  Ruzzo  [120].)  The  PRAM  was  introduced  by  Fortune 
and  Wyllie  [103]  and  Goldschlager  [118,119]. 

Several  problems  on  the  efficient  simulation  of  RAMs  are  from  Papadimitriou  [235]. 


E  R 
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The  circuit  complexity  of  a  binary  function  is  measured  by  the  size  or  depth  of  the  smallest 
or  shallowest  circuit  for  it.  Circuit  complexity  derives  its  importance  from  the  corollary  to 
Theorem  3.9.2;  namely,  if  a  function  has  a  large  circuit  size  over  a  complete  basis  of  fixed 
fan-in,  then  the  time  on  a  Turing  machine  required  to  compute  it  is  large.  The  importance  of 
this  observation  is  illustrated  by  the  following  fact.  For  n  >  1,  let  be  the  characteristic 
function  of  an  NP  -complete  language  L,  where  fjn>  has  value  1  on  strings  of  length  n  in  L 
and  value  0  otherwise.  If  f j"'1  has  super-polynomial  circuit  size  for  all  sufficiently  large  n,  then 

py  np. 

In  this  chapter  we  introduce  methods  for  deriving  lower  bounds  on  circuit  size  and  depth. 
Unfortunately,  it  is  generally  much  more  difficult  to  derive  good  lower  bounds  on  circuit 
complexity  than  good  upper  bounds;  an  upper  bound  measures  the  size  or  depth  of  a  particular 
circuit  whereas  a  lower  bound  must  rule  out  a  smaller  size  or  depth  for  all  circuits.  As  a 
consequence,  the  lower  bounds  derived  for  functions  realized  by  circuits  over  complete  bases 
of  bounded  fan-in  are  often  weak. 

In  attempting  to  understand  lower  bounds  for  complete  bases,  researchers  have  studied 
monotone  circuits  over  the  monotone  basis  and  bounded-depth  circuits  over  the  basis  {AND, 
OR,  NOT}  in  which  the  first  two  gates  are  allowed  to  have  unbounded  fan-in.  Formula  size, 
which  is  approximately  the  size  of  the  smallest  circuit  with  fan-out  1,  has  also  been  studied. 
Lower  bounds  to  formula  size  also  produce  lower  bounds  to  circuit  depth,  a  measure  of  the 
parallel  time  needed  for  a  function. 

Research  on  these  restricted  circuit  models  has  led  to  some  impressive  results.  Exponential 
lower  bounds  on  circuit  size  have  been  derived  for  monotone  functions  over  the  monotone 
basis  and  functions  such  as  parity  when  realized  by  bounded-depth  circuits.  Unfortunately, 
the  methods  used  to  obtain  these  results  may  not  apply  to  complete  bases  of  bounded  fan-in. 
Fortunately,  it  has  been  shown  that  the  slice  functions  have  about  the  same  circuit  size  over 

both  the  monotone  and  standard  (non-monotone)  bases.  This  may  help  resolve  the  P  =  NP 
question,  since  there  are  NP-complete  slice  problems. 

Despite  the  difficulty  of  deriving  lower  bounds,  circuit  complexity  continues  to  offer  one 
of  the  methods  of  highest  potential  for  distinguishing  between  P  and  NP. 
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In  this  section  we  characterize  types  of  logic  circuits  by  their  bases  and  the  fan-in  and  fan¬ 
out  of  basis  elements.  We  consider  bases  that  are  complete  and  incomplete  and  that  have 
bounded  and  unbounded  fan-in.  We  also  consider  circuits  in  which  the  fan-out  is  restricted 
and  unrestricted.  Each  of  these  factors  can  affect  the  size  and  depth  of  a  circuit. 

9.1.1  Circuit  Models 

The  (general)  logic  circuit  is  the  graph  of  a  straight-line  program  in  which  the  variables  have 
value  0  or  1  and  the  operations  are  Boolean  functions  g  :  Bp  i— >  B,  p  >  1 .  (Boolean  functions 
have  one  binary  value.  Logic  circuits  are  defined  in  Section  1.2  and  discussed  at  length  in 
Chapter  2.)  The  vertices  in  a  logic  circuit  are  labeled  with  Boolean  operations  and  are  called 
gates;  the  set  of  different  gate  types  used  in  a  circuit  is  called  the  basis  (denoted  fi)  for  the 
circuit.  The  fan-in  of  a  basis  is  the  maximal  fan-in  of  any  function  in  the  basis.  A  circuit 
computes  the  binary  function  /  :  Bn  i— >  Bm ,  which  is  the  mapping  from  the  n  circuit  inputs 
to  the  m  gate  outputs  designated  as  circuit  outputs. 

The  standard  basis,  denoted  fio>  is  the  set  {AND,  OR,  NOT}  in  which  AND  and  OR  have 
fan-in  2.  The  full  two-input  basis,  denoted  B2,  consists  of  all  two-input  Boolean  functions. 
The  dyadic  unate  basis,  denoted  U2,  consists  of  all  Boolean  functions  of  the  form  ( xa  A  yb)c 
for  constants  a,  b,  c  in  B.  Here  x1  =  x  and  x°  =  x. 

A  basis  H  is  complete  if  every  binary  function  can  be  computed  by  a  circuit  over  f l.  The 
bases  Ho,  B2,  and  U2  are  complete,  as  is  the  basis  consisting  of  the  NAND  gate  computing  the 
function  x  NAND  y  =  x  A  y.  (See  Problem  2.5.) 

The  bounded  fan-out  circuit  model  specifies  a  bound  on  the  fan-out  of  a  circuit.  As  we 
shall  see,  the  fan-out-1  circuit  plays  a  special  role  related  to  circuit  depth.  Each  circuit  of 
fan-out  1  corresponds  to  a  formula  in  which  the  operators  are  the  functions  associated  with 
vertices  of  the  circuit.  Figure  9.1  shows  an  example  of  a  circuit  of  fan-out  1  over  the  standard 
basis  and  its  associated  formula.  (See  also  Problem  9.9.)  Although  each  input  variable  appears 
once  in  this  example,  Boolean  functions  generally  require  multiple  instances  of  variables  (have 
fan-out  greater  than  1).  Formula  size  is  studied  at  length  in  Section  9.4. 

To  define  the  monotone  circuits,  we  need  an  ordering  of  binary  n-tuples.  Two  such  tuples, 
x  =  (xi,  X2, . . . ,  xn)  and  y  =  (j/i,  J/2.  ■  ■  •  >  Vn)>  are  in  the  relation  x  <  y  if  for  all  1  <  i  <  n, 
Xi  <  yt,  where  0  <  0,  1  <  1,  and  0  <  1,  but  1^0.  (Thus,  001011  <  101111,  but 
011011  ^  101111.) 

A  monotone  circuit  is  a  circuit  over  the  monotone  basis  Hmon  =  {AND,  OR}  in  which 
the  fan-in  is  2.  There  is  a  direct  correspondence  between  monotone  circuits  and  monotone 
functions.  A  monotone  function  is  a  function  /  :  Bn  1— >  Bm  that  is  either  monotone 
increasing,  that  is,  for  all  x,  y  <E  Bn,  if  x  <  y,  then  f(x)  <  f(y),  or  is  monotone 
decreasing,  that  is,  for  all  x,  y  £  Bn,  if  x  <  y,  then  f(x)  >  f(y).  Unless  stated  explicitly,  a 
monotone  function  will  be  understood  to  be  a  monotone  increasing  function. 

A  monotone  Boolean  function  has  the  following  expansion  on  the  first  variable,  as  the 
reader  can  show.  (See  Problem  9.10.)  A  similar  expansion  is  possible  on  any  variable. 

f(x i,x2, . .  .,xn)  =  /( 0,x2,  ■  •  ■ , xn)  V  (xi  A  /( 1, x2, . 

By  applying  this  expansion  to  every  variable  in  succession,  we  see  that  each  monotone  function 
can  be  realized  by  a  circuit  over  the  monotone  basis.  Furthermore,  the  monotone  basis  Hmon 


©John  E  Savage 


9.1  Circuit  Models  and  Measures 


393 


y  =  {{{(X7  v  x6)  A  (x5  V  x4))  V  X})  A  (x2  A  aq)) 


Figure  9. 1  A  circuit  of  fan-out  1  over  a  basis  with  fan-in  2  and  a  corresponding  formula.  The 
value  y  at  the  root  is  the  AND  of  the  value  (((*7  V  Xs)  A  (aq  V  x4))  V  £3)  of  the  left  subtree  with 
the  value  (aq  All)  of  the  right  subtree. 


is  complete  for  the  monotone  functions,  that  is,  every  monotone  function  can  be  computed 
by  a  circuit  over  the  basis  flmon.  (See  Problem  2.) 

In  Section  9.6  we  show  that  some  monotone  functions  on  n  variables  require  monotone 
circuits  whose  size  is  exponential  in  n.  In  particular,  some  monotone  functions  requiring 
exponential-size  monotone  circuits  can  be  realized  by  polynomial-size  circuits  over  the  standard 
basis  Q0.  Thus,  the  absence  of  negation  can  result  in  a  large  increase  in  circuit  size. 

The  bounded-depth  circuit  is  a  circuit  over  the  standard  basis  Hq  where  the  fan-in  of  AND 
and  OR  gates  is  allowed  to  be  unbounded,  but  the  circuit  depth  is  bounded.  The  conjunctive 
and  disjunctive  normal  forms  and  the  product-of-sums  and  sum-of-products  normal  forms 
realize  arbitrary  Boolean  functions  by  circuits  of  depth  2  over  fl0-  (See  Section  2.3.)  In  these 
normal  forms  negations  are  used  only  on  the  input  variables.  Note  that  any  circuit  over  the 
standard  basis  can  be  converted  to  a  circuit  in  which  the  NOT  gates  are  applied  only  to  the 
input  variables.  (See  Problem  9.1 1.) 

9.1.2  Complexity  Measures 

We  now  define  the  measures  of  complexity  studied  in  this  chapter.  The  depth  of  a  circuit  is 
the  number  of  gates  of  fan-in  2  or  more  on  the  longest  path  in  the  circuit.  (Note  that  NOT 
gates  do  not  affect  the  depth  measure.) 

DEFINITION  9.1.1  The  circuit  size  of  a  binary  function  f  :  Bn  1— >  Bm  with  respect  to  the  basis 
fl,  denoted  Cq  (/),  is  the  smallest  number  of  gates  in  any  circuit  for  f  over  the  basis  f l.  The  circuit 
size  with  fan-out  s,  denoted  CStn(f),  is  the  circuit  size  off  when  the  circuit  fan-out  is  limited 
to  at  most  s. 
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The  circuit  depth  of  a  binary  fimction  /  :  Bn  i— >  Bm  with  respect  to  the  basis  hi,  Dq(/),  is 
the  depth  of  the  smallest  depth  circuit  for  f  over  the  basis  h l.  The  circuit  depth  with  fan-out  s, 
denoted  Dsq  (/)>  is  the  circuit  depth  off  when  the  circuit  fan-out  is  limited  to  at  most  s. 

The  formula  size  of  a  Boolean  function  f  :  Bn  i— >  B  with  respect  to  a  basis  f l,  Lfi(f),  is  the 
minimal  number  of  input  vertices  in  any  circuit  of  fan-out  1  for  f  over  the  basis  hi. 

It  is  important  to  note  the  distinction  between  formula  and  circuit  size:  in  the  former 
the  number  of  input  vertices  is  counted,  whereas  in  the  latter  it  is  the  number  of  gates.  A 
relationship  between  the  two  is  shown  in  Lemma  9.2.2. 

9.2  Relationships  Among  Complexity  Measures 

In  this  section  we  explore  the  effect  on  circuit  complexity  measures  of  a  change  in  either  the 
basis  or  the  fan-out  of  a  circuit.  We  also  establish  relationships  between  circuit  depth  and 
formula  size. 

9.2.1  Effect  of  Fan-Out  on  Circuit  Size 

It  is  interesting  to  ask  how  the  circuit  size  and  depth  of  a  function  change  as  the  maximal  fan¬ 
out  of  a  circuit  is  reduced.  This  issue  is  important  in  understanding  these  complexity  measures 
and  in  the  use  of  technologies  that  limit  the  fan-out  of  gates.  The  following  simple  facts  about 
trees  are  useful  in  comparing  complexity  measures.  (See  Problem  9.2.) 

LEMMA  9.2. 1  A  rooted  tree  of  maximal  fan-in  r  containing  k  vertices  has  at  most  k(r  —  1 )  +  1 
leaves  and  a  rooted  tree  with  l  leaves  and  fan-in  r  has  at  most  l  —  1  vertices  with  fan-in  2  or  more 
and  at  most  2(1  —  1)  edges. 

From  the  above  result  we  establish  the  following  connection  between  circuit  size  with  fan¬ 
out  1  and  formula  size. 

LEMMA  9.2.2  Let  hi  be  a  basis  of  fan-in  r.  For  each  f  :  Bn  >  B  the  following  inequalities  hold 
between  formula  size,  Lq(/),  and  fan-out- 1  circuit  size,  Ci,q(/).' 

(Ln(f)  -  l)/(r  -  1)  <  Chn(f)  <  3 Ln(f)  -  2 

Proof  The  first  inequality  follows  from  the  definition  of  formula  size  and  the  first  result 
stated  in  Lemma  9.2.1  in  which  k  =  C'ijq(/).  The  second  inequality  also  follows  from 
Lemma  9.2.1.  A  tree  with  Lq(/)  leaves  has  at  most  Lq(/)  —  1  vertices  with  fan-in  of  2  or 
more  and  at  most  2  (Lq(/)  —  1)  edges  between  vertices  (including  the  leaves).  Each  of  these 
edges  can  carry  a  NOT  gate,  as  can  the  output  gate,  for  a  total  of  at  most  2Lq(/)  —  1  NOT 
gates.  Thus,  a  circuit  of  fan-out  1  has  at  most  3  Ln(/)  —  2  gates.  ■ 

As  we  now  show,  circuit  size  increases  by  at  most  a  constant  factor  when  the  fan-out  of  the 
circuit  is  reduced  to  s  for  s  >  2.  Before  developing  this  result  we  need  a  simple  fact  about  a 
complete  basis  hi,  namely,  that  at  most  two  gates  are  needed  to  compute  the  identity  function 
i(x)  =  x,  as  shown  in  the  next  paragraph.  If  a  basis  contains  AND  or  OR  gates,  the  identity 
function  can  be  obtained  by  attaching  both  of  their  inputs  to  the  same  source. 

We  are  done  if  hi  contains  a  function  such  that  by  fixing  all  but  one  variable,  i(x)  is 
computed.  If  not,  then  we  look  for  a  non-monotone  function  in  hi.  Since  some  binary 
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functions  are  non-monotone  ( x ,  for  example),  some  function  g  in  a  complete  basis  f l  is  non¬ 
monotone.  This  means  there  exist  tuples  x  and  y  for  g,  x  <  y,  such  that  g{x)  =  1  >  g(y)  = 
0.  Let  u  and  v  be  the  largest  and  smallest  tuples,  respectively,  satisfying  x  <  u  <  v  <  y 
and  g(u)  =  1  and  g(v)  =  0.  Then  u  and  v  differ  in  at  most  one  position.  Without  loss 
of  generality,  let  that  position  be  the  first  and  let  the  values  in  the  remaining  positions  in 
both  tuples  be  ©, . . . ,  cn).  It  follows  that  5(1,  C2, . . . ,  cn)  =  0  and  5(0,  C2, . . . ,  cn)  =  1  or 
g(x,  C2, . . . ,  cn)  =  x.  If  Z(fl)  is  the  number  of  gates  from  f l  needed  to  realize  the  identity 
function,  then  l  (ft)  =  1  or  2. 

THEOREM  9.2. 1  Let  f 1  be  a  complete  basis  of  fan-in  r  and  let  f  :  Bn  1— >  Bm.  The  following 
inequalities  bold  on  CStn(f): 

Cn(f)  <  Cs+hn(f)  <  CsM  <  (/) 


Furthermore,  Csgi(f)  has  the  following  relationship  to  Cn(/)  for  s  >  2; 


Cs,a(f)  <  Cn(f) 


m(r- 1)^ 


Proof  The  first  set  of  inequalities  holds  because  a  smallest  circuit  with  fan-out  s  is  no  smaller 
than  a  smallest  circuit  with  fan-out  s  +  1,  a  less  restrictive  type  of  circuit. 

The  last  inequality  follows  by  constructing  a  tree  of  identity  functions  at  each  gate  whose 
fan-out  exceeds  s.  (See  Fig.  9.2.)  If  a  gate  has  fan-out  (f>  >  s,  reduce  the  fan-out  to  s  and 
then  attach  an  identity  gate  to  one  of  these  s  outputs.  This  increases  the  fan-out  from  s  to 
s  +  s  —  1 .  If  (f>  is  larger  than  this  number,  repeat  the  process  of  adding  an  identity  gate  k 
times,  where  k  is  the  smallest  integer  such  that  s  +  k(s  —  1)  >  <f>  or  is  the  largest  integer 
such  that  s  +  (k  —  l)(s  —  1)  <  <f>.  Thus,  k  <(</>—  1) /(s  —  1). 

Let  <f>i  denote  the  fan-out  of  the  ith  gate  in  a  circuit  for  /  of  potentially  unbounded 
fan-out  and  let  ki  be  the  largest  integer  satisfying  the  following  bound: 


Then  at  most  ]TL  (fcjZ(fl)  +  1)  gates  are  needed  in  the  circuit  of  fan-out  s  to  realize  /, 
one  for  the  ith  gate  in  the  original  circuit  and  kil(Cl)  gates  for  the  k,  copies  of  the  identity 


(a) 


(b) 


Figure  9.2  Conversion  of  a  vertex  with  fan-out  more  than  s  to  a  subtree  with  fan-out  s, 
illustrated  for  s  =  2. 
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function  at  the  ith  gate.  Note  that  JT  is  the  number  of  edges  directed  away  from  gates 
in  the  original  circuit.  But  since  each  edge  directed  away  from  a  gate  is  an  edge  directed  into 
a  gate,  this  number  is  at  most  rCa(f)  since  each  gate  has  fan-in  at  most  r. 

It  follows  that  the  smallest  number  of  gates  in  a  circuit  with  fan-out  s  for  /  satisfies  the 


following  bound: 


which  demonstrates  that  circuit  size  with  a  fan-out  s  >  2  differs  from  the  unbounded  fan¬ 
out  circuit  size  by  at  most  a  constant  factor.  ■ 

With  the  construction  employed  in  Theorem  9.2.1,  an  upper  bound  can  be  stated  on 
Ds,n (/)  that  is  proportional  to  the  product  of  Dn(f)  and  log  Cn(f).  (See  Problem  9.12.) 
The  upper  bound  stated  above  on  Cs,n(/)  can  be  achieved  by  a  circuit  that  also  achieves  an 
upper  bound  on  Ds>ci(f)  that  is  proportional  to  I9q(/)  and  logrS  [138]. 

9.2.2  Effect  of  Basis  Change  on  Circuit  Size  and  Depth 

We  now  consider  the  effect  of  a  change  in  basis  on  circuit  size  and  depth.  In  the  next  section 
we  examine  the  relationship  between  formula  size  and  depth,  from  which  we  deduce  the  effect 
of  a  basis  change  on  formula  size. 

LEMMA  9.2.3  Given  two  complete  bases,  S2a  and  flb>  and  a  function  /  :  Bn  i— >  Bm,  the  circuit 
size  and  depth  off  in  these  two  bases  differ  by  at  most  constant  midtip  licative  factors. 

Proof  Because  each  basis  is  complete,  every  function  in  can  be  computed  by  a  fixed 
number  of  gates  in  Of,,  and  vice  versa.  Given  a  circuit  with  basis  f la,  a  circuit  with  basis 
f lb  can  be  constructed  by  replacing  each  gate  from  f la  by  a  fixed  number  of  gates  from 
f lb-  This  has  the  effect  of  increasing  the  circuit  size  by  at  most  a  constant  factor.  It  follows 
that  Coa(/)  =  e(Cnb(f)).  Since  this  construction  also  increases  the  depth  by  at  most  a 
constant  factor,  it  follows  that  Daa  (/)  =  <d(DQb(f)).  m 

9.2.3  Formula  Size  Versus  Circuit  Depth 

A  logarithmic  relationship  exists  between  the  formula  size  and  circuit  depth  of  a  function,  as 
we  now  show.  If  a  formula  is  represented  by  a  balanced  tree,  this  result  follows  from  the  fact 
that  the  circuit  fan-in  is  bounded.  However,  since  we  cannot  guarantee  that  each  formula 
corresponds  to  a  balanced  tree,  we  must  find  a  way  to  balance  an  unbalanced  tree. 

To  balance  a  formula  and  provide  a  bound  on  the  circuit  depth  of  a  function  in  terms  of 
formula  size,  we  make  use  of  the  multiplexer  function  /mux  :  B1  +n  i— >  B  on  three  inputs 
/mux  (a,  z/i ,  yo).  Here  the  value  of  a  determines  which  of  the  two  other  values  is  returned. 


a  =  0 


a  =  1 


This  function  can  be  realized  by 


/mux(a>  2/i,  2/o )  =  (a  A  y0)  V  (a  A  yf 
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The  measure  d(Ll)  of  a  basis  O  defined  below  is  used  to  obtain  bounds  on  the  circuit  depth  of 
a  function  in  terms  of  its  formula  size. 

DEFINITION  9.2. 1  Given  a  basis  f \  of  fan-in  r,  the  constant  d(Cl)  is  defined  as  follows: 

dm  =  (on  (£)  +  l)  /logr 
Over  the  standard  basis  LIq,  d(C2o)  =  3.419. 

We  now  derive  a  separator  theorem  for  trees.  This  is  a  theorem  stating  that  a  tree  can 
be  decomposed  into  two  trees  of  about  the  same  size  by  removing  one  edge.  We  begin  by 
establishing  a  property  about  trees  that  implies  the  separator  theorem. 

LEMMA  9.2.4  Let  T  be  a  tree  with  n  internal  (non- leaf)  vertices.  If  the  fan-in  of  every  vertex  of 
T  is  at  most  r,  then  for  any  k,  1  <  k  <  n,  T  has  a  vertex  v  such  that  the  subtree  Tv  rooted  at  v 
has  at  least  k  leaves  but  each  of  its  children  TVl ,  TVl , ,  Tv  ,  p  <  r,  has  fewer  than  k  leaves. 

Proof  If  the  property  holds  at  the  root,  the  result  follows.  If  not,  move  to  some  subtree  of 
T  that  has  at  least  k  leaves  and  apply  the  test  recursively.  Because  a  leaf  vertex  has  one  leaf 
vertex  in  its  subtree,  this  process  terminates  on  some  vertex  v  at  which  the  property  holds. 
If  it  terminates  on  a  leaf  vertex,  each  of  its  children  is  an  empty  tree.  ■ 

COROLLARY  9.2. 1  Let  T  be  a  tree  of  fan-in  r  with  n  leaves.  Then  T  has  a  subtree  Tv  rooted  at 
a  vertex  v  such  that  Tv  has  at  least  \n/(r  +  1)]  leaves  but  at  most  [ rn /  (r  +  1 ) J . 

Proof  Let  v  be  the  vertex  of  Lemma  9.2.4  and  let  k  =  [  n/(r  +  1)] .  Since  Tv  has  at  most 
r  subtrees  each  containing  no  more  than  \n/[r  +  1)]  —  1  <  n/{r  +  1)  leaves,  the  result 
follows.  ■ 

We  now  apply  this  decomposition  of  trees  to  develop  bounds  on  formula  size. 

THEOREM  9.2.2  Let  Q,  be  a  complete  basis  of  fizn-in  r.  Any  function  f  :  Bn  *—>  B  with  formula 
size  Lfi(/)  >  2  has  circuit  depth  Dq(/)  satisfying  the  following  bounds: 

logrLn(/)  <  Dn(f)  <  d(Cl)  logr  Ln(f) 

Proof  The  lower  bound  follows  because  a  rooted  tree  of  fan-in  r  with  depth  d  has  at  most 
rd  leaves.  Since  Lfi(/)  leaves  are  needed  to  compute  /  with  a  tree  circuit  over  f 2,  the  result 
follows  directly. 

The  derivation  of  the  upper  bound  is  by  induction  on  formula  size.  We  first  establish 
the  basis  for  induction:  that  Dq(J)  <  d(Ll)  logr  Ln(/)  for  Ln(/)  =  2.  To  show  this, 
observe  that  any  function  /  with  Ln(/)  =  2  depends  on  at  most  two  variables.  There  are  16 
functions  on  two  variables  (which  includes  the  functions  on  one  variable),  of  which  10  have 
the  property  that  both  variables  affect  the  output.  Each  of  these  1 0  functions  can  be  realized 
from  a  circuit  for  /mux  by  adding  at  most  one  NOT  gate  on  one  input  and  one  NOT  on 
the  output.  (See  Problem  9.13.)  But,  as  seen  from  the  discussion  preceding  Theorem  9.2.1, 
every  complete  basis  contains  a  non-monotone  function  all  but  one  of  whose  inputs  can  be 
fixed  so  that  the  functions  computes  the  NOT  of  its  one  remaining  input.  Thus,  a  circuit 

with  depth  Dq  (  /mux  )  +  2  suffices  to  realize  a  function  with  Ln(/)  =  2. 
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The  basis  for  induction  is  that  Dq  +  2  <  d(f2)  log,  Tq(/)  for  Tq(/)  =  2, 

which  we  now  show. 

d(n)logrLQ(f)  =  (/W*)  +  l)  (log,  2)/ log,  ) 

=  (-Dn  (/iux)  +  l)  /  !og2 

>  1.7  (l>n  (/l)  +  l)  >  -Of2  (/mux)  +  2 

since  (r  +  l)/r  <  1.5  and  T>q  (/mux)  >  1. 

The  inductive  hypothesis  is  that  any  function  /  with  a  formula  size  Tn(/)  <  Lq  —  1 
can  be  realized  by  a  circuit  with  depth  d( 0)  log,  Tn(/). 

Let  T  be  the  tree  associated  with  a  formula  for  /  of  size  L0.  The  value  computed  by 
T  can  be  computed  from  the  function  /mux  using  the  values  produced  by  three  trees,  as 
suggested  in  Fig.  9.3.  The  tree  Tv  of  Corollary  9.2.1  and  two  copies  of  T  from  which  Tv 
has  been  removed  and  replaced  by  0  in  one  case  (the  tree  To)  and  1  in  the  other  (the  tree 
T\)  are  formed  and  the  value  of  Tv  is  used  to  determine  which  of  To  and  T\  is  the  value  T. 
Since  Tv  has  at  least  \Lq/{t  +  1 )]  and  at  most  [r Lq / (r  +  \)\  <  Lq  —  1  leaves,  each  of  To 
and  T\  has  at  most  L0  —  \L0/(r  +  1)]  =  [rL0/(r  +  1)J  leaves.  (See  Problem  9.1.)  Thus, 
all  trees  have  at  most  [rLo/(r  +1)J  <  To  ~  1  leaves  and  the  inductive  hypothesis  applies. 
Since  the  depth  of  the  new  circuit  is  the  depth  of  /mux  plus  the  maximum  of  the  depths  of 
the  three  trees,  /  has  the  following  depth  bound: 

Dn(f)  <  Dn  (/«*)  +  d{Q)  log, 

The  desired  result  follows  from  the  definition  of  d(P2).  ■ 


Figure  9.3  Decomposition  of  a  tree  circuit  T  for  the  purpose  of  reducing  its  depth.  A  large 
subtree  Tv  is  removed  and  its  value  used  to  select  the  value  computed  by  two  trees  formed  from 
the  original  tree  by  replacing  the  value  of  Tv  alternately  by  0  and  1 . 
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Combining  this  result  with  Lemma  9.2.3,  we  obtain  a  relationship  between  the  formula 
sizes  of  a  function  over  two  different  complete  bases. 

THEOREM  9.2.3  Let  f la  and  fit  be  two  complete  bases  with  fan-in  ra  and  rb,  respectively.  There 
is  a  constant  a  such  that  the  formula  size  of  a  function  f  :  Bn  i— >  B  with  respect  to  these  bases 
satisfies  the  following  relationship: 

LnAf)<[Lnb(fT 

Proof  Let  L)na  (/)  and  Dnb(f)  be  the  depth  of  /  over  the  bases  f la  and  f lb,  respectively. 
From  Theorem  9.2.2,  logr<i  Lfia(/)  <  -Dfia(/)  and  <  d(filb)  logrb  Lfib(/). 

From  Lemma  9.2.3  we  know  there  is  a  constant  da,b  such  that  if  a  function  /  :  Bn  i— >  B 
has  depth  Dfib  (/)  over  the  basis  fit,  then  it  has  depth  Dqu  (/)  over  the  basis  f la>  where 

Dnfif)  <  da,bDnb(f) 

The  constant  da,b  is  the  depth  of  the  largest-depth  basis  element  of  ttb  when  realized  by  a 
circuit  over  fla. 

Combining  these  facts,  we  have  that 

Lnfif)  <  ( ra)D <  {ra)d^D-ff) 

<  [ra)da-bd^b'>Xogrb  Lnb(f) 

<  Lnb(f)da-bd{nb){ logrb  ra) 

Here  we  have  used  the  identity  xlogy  2  =  zlogy  x .  ■ 

This  result  can  be  extended  to  the  monotone  basis.  (See  Problem  9.14.)  We  now  derive  a 
relationship  between  circuit  size  and  depth. 


9.3  Lower-Bound  Methods  for  General  Circuits 

In  Chapter  2  upper  bounds  were  derived  for  a  variety  of  functions,  including  logical,  arith¬ 
metic,  shifting,  and  symmetric  functions  as  well  as  encoder,  decoder,  multiplexer,  and  demul¬ 
tiplexer  functions.  We  also  established  lower  bounds  on  size  and  depth  of  the  most  complex 
Boolean  functions  on  n  variables.  In  this  section  we  present  techniques  for  deriving  lower 
bounds  on  circuit  size  and  depth  for  particular  functions  when  realized  by  general  logic  circuits. 

9.3.1  Simple  Lower  Bounds 

A  function  /  :  Bn  i— >  B  on  n  variables  is  dependent  on  its  /th  variable,  Xi,  if  there  exist 
values  Ci,  C2, . . . ,  Cj_i,  Cj+i, .  . . ,  cn  such  that 

/(ci,c2, ...  >  Ci— i,  0,  Ci- pi,  .  .  .  j  Cn  )  f  f(cUC2,...,Ci-i,l,  Ci- 1-1,  ■  •  •  ,  Cn) 

This  simple  property  leads  to  lower  bounds  on  circuit  size  and  depth  that  result  from  the 
connectivity  that  a  circuit  must  have  to  compute  a  function  depending  on  each  of  its  variables. 
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THEOREM  9.3. 1  Let  f  :  Bn  i— >  B  be  dependent  on  each  of  its  n  variables.  Then  over  each  basis 
LI  of  fan-in  r,  the  size  and  depth  of  f  satisfies  the  following  lower  bounds: 

n  —  1 
r  —  1 

Dn(f)> 

Proof  Consider  a  circuit  of  size  Cq{/)  for  /.  Since  it  has  fan-in  r,  it  has  at  most  rCn(f) 
edges  between  gates.  After  we  show  that  this  circuit  also  has  at  least  Cn(f)  +  n  —  1  edges, 
we  observe  that  rCn(f)  >  C'n(Z)  +  n  —  1,  from  which  the  conclusion  follows. 

Since  /  depends  on  each  of  its  n  variables,  there  must  be  at  least  one  edge  attached  to 
each  of  them.  Similarly,  because  the  circuit  has  minimal  size  there  must  be  at  least  one  edge 
attached  to  each  of  the  Cn(f)  gates  except  possibly  for  the  output  gate.  Thus,  the  circuit 
has  at  least  Cq(/)  +  n  —  1  edges  and  the  conclusion  follows. 

The  depth  lower  bound  uses  the  fact  that  a  circuit  with  depth  d  and  fan-in  r  with  the 
largest  number  of  inputs  is  a  tree.  Such  trees  have  at  most  rd  leaves  (input  vertices).  Because 
/  depends  on  each  of  its  variables,  a  circuit  for  /  of  depth  d  has  at  least  n  and  at  most  rd 
leaves,  from  which  the  depth  lower  bound  follows.  ■ 

This  lower  bound  is  the  best  possible  given  the  information  used  to  derive  it.  To  see  this, 
observe  that  the  function  f(x\,  X2,  •  ■  • ,  xn)  =  X\  A  X2  A  •  •  •  A  xn,  which  depends  on  each  of 
its  variables,  has  circuit  size  \(n  —  l)/(r  —  1)]  and  depth  |~logr  n]  over  the  basis  containing 
the  r-input  AND  gate.  (See  Problem  9.15.) 

9.3.2  The  Gate-Elimination  Method  for  Circuit  Size 

The  search  for  methods  to  derive  large  lower  bounds  on  circuit  size  for  functions  over  complete 
bases  has  to  date  been  largely  unsuccessful.  The  largest  lower  bounds  on  circuit  size  that  have 
been  derived  for  explicitly  defined  functions  are  linear  in  n,  the  number  of  variables  on  which 
the  functions  depend.  Since  most  Boolean  functions  on  n  variables  have  exponential  size  (see 
Theorem  2.12.1),  functions  do  exist  that  have  high  complexity.  Unfortunately,  this  fact  doesn’t 
help  us  to  show  that  any  particular  problem  has  high  circuit  size.  In  particular,  it  does  not  help 
us  to  show  that  P  f  NP. 

In  this  section  we  introduce  the  gate-elimination  method  for  deriving  linear  lower  bounds. 
When  applied  with  care,  it  provides  the  strongest  known  lower  bounds  for  complete  bases. 
The  gate-elimination  method  uses  induction  on  the  properties  of  a  function  f  on  n  variables 
to  show  two  things:  a)  a  few  variables  of  /  can  be  assigned  values  so  that  the  resulting  function 
is  of  the  same  type  as  /,  and  b)  a  few  gates  in  any  circuit  for  /  can  be  eliminated  by  this 
assignment  of  values.  After  eliminating  all  variables  by  assigning  values  to  them,  the  function 
is  constant.  Since  the  number  of  gates  in  the  original  circuit  cannot  be  smaller  than  the  number 
removed  during  this  process,  the  original  circuit  has  at  least  as  many  gates  as  were  removed. 

We  now  apply  the  gate-elimination  method  to  functions  in  the  class  Q ^  defined  below. 
Functions  in  this  class  have  at  least  three  different  subfunctions  when  any  pair  of  variables 
ranges  through  all  four  possible  assignments. 

DEFINITION  9.3. 1  A  Boolean  function  f  :  Bn  B  belongs  to  the  class  Q ^  if  for  any  two 
variables  Xi  and  Xj,  f  has  at  least  three  distinct  subfimctions  as  Xi  and  Xj  range  over  all  possible 
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values.  Furthermore,  for  each  variable  Xi  there  is  a  value  Ci  such  that  the  subfunction  of  f  obtained 
by  assigning  x i  the  value  Ci  is  in  Q23  ^  • 

The  class  Q2'3  contains  the  function  f^od  3  c  :  Bn  1— >  B,  as  we  show.  Here  z  mod  a  is 
the  remainder  of  z  after  removing  all  multiples  of  a. 


LEMMA  9.3. 1  For  n  >  3  and  c  £  {0,  1,2},  the  function  /©xl  3  c  :  Bn  1— ►  B  defined  below  is 
in  Q23: 

/mod  3, C(X1’  *2,  ■  •  ■ ,  xn)  =  {(v  +  c)  mod  3)  mod  2 
where  y  =  ^}"=1  Xi  andff  and  +  denote  integer  addition. 


Proof  We  show  that  the  functions  /  3  c,  c  £  {0,1,2},  are  all  distinct  when  n  >  1. 

When  n  =  1,  the  functions  are  different  because  /^od  30(^1)  =  *1,  d31(xi)  = 

XU  and  /mod  3,2 (^1)  =  0.  For  n  =  2,  y  can  assume  values  in  {0,  1,2}.  Because  the 
functions  f^od  3fi(xuX2),  f^od  3tfxux2),  and  /®d  3a(xux2)  have  value  1  only  when 
y  =  X\  +  x2  =  1,  0,  2,  respectively,  the  three  functions  are  different. 

The  proof  of  membership  of  /©d  3  c  in  Q^f3  is  by  induction.  The  base  case  is  n  =  3, 
which  holds,  as  shown  in  the  next  paragraph.  The  inductive  hypothesis  is  that  for  each 

c  c  {<m.  2},  /SS’*, «  cST1’- 


To  show  that  for  n  >  3,  /  ^ d  3  c  has  at  least  three  distinct  subfunctions  as  any  two  of  its 
variables  range  over  all  values,  let  y*  be  the  sum  of  the  n  —  2  variables  that  are  not  fixed  and 
let  c*  be  the  sum  of  c  and  the  values  of  the  two  variables  that  are  fixed.  Then  the  value  of  the 
function  is  ((y*  +  c*)  mod  3)  mod  2  =  (((y*  mod  3)  +  (c*  mod  3))  mod  3)  mod  2. 
Since  (y*  mod  3)  and  (c*  mod  3)  range  over  the  values  0,  1,  and  2,  the  three  functions  are 
different,  as  shown  in  the  first  paragraph  of  this  proof. 

To  show  that  for  any  variable  Xi  there  is  an  assignment  q  such  that  /mod 3c  *n 


Q23  let  c  =  0.  ■ 


We  now  derive  a  lower  bound  on  the  circuit  size  of  functions  in  the  class 


Q 


(n) 
2,3  • 


THEOREM  9.3.2  Over  the  basis  of  all  Boolean  functions  on  two  inputs,  fl,  if  f  £  Q23  for 
n  >  3,  then 

Cn(f)>2n-3 


Proof  We  show  that  /  depends  on  each  of  its  variables.  Suppose  it  does  not  depend  on 
Xi.  Then,  pick  Xi  and  a  second  variable  Xj  and  let  them  range  over  all  four  possible  values. 
Since  the  value  of  Xi  has  no  effect  on  /,  /  has  at  most  two  subfunctions  as  Xi  and  Xj  range 
over  all  values,  contradicting  its  definition. 

We  now  show  that  some  input  vertex  Xi  of  a  circuit  for  /  has  fan-out  of  2  or  more. 
Consider  a  gate  y  in  a  circuit  for  /  whose  longest  path  to  the  output  gate  is  longest.  (See 
Fig.  9.4.)  Since  the  circuit  does  not  have  loops  and  no  other  vertex  is  farther  away  from  the 
output,  both  of  y  s  input  edges  must  be  attached  to  input  vertices.  Let  Xi  and  Xj  be  the  two 
inputs  to  this  gate.  If  the  fan-out  of  both  of  these  input  vertices  is  1,  they  influence  the  value 
of  /  only  through  the  one  gate  to  which  they  are  connected.  Since  this  gate  has  at  most  two 
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Figure  9.4  A  circuit  in  which  gates  <?4  has  maximal  distance  from  the  output  gate  ga .  The  input 
X2  has  fan-out  2. 


values  for  the  four  assignments  to  inputs,  /  has  at  most  two  subfunctions,  contradicting  the 
definition  of  /. 

If  n  =  3,  this  fact  demonstrates  that  the  fan-out  from  the  three  inputs  has  to  be  at 
least  4,  that  is,  the  circuit  has  at  least  four  inputs.  From  Theorem  9.3.1  it  follows  that 
Cq  (/)  >  2n  —  3  for  n  =  3.  This  is  the  base  case  for  a  proof  by  induction. 

The  inductive  hypothesis  is  that  for  any  f*  £  Q23  Cq (/*)  >  2 (n  —  1)  —  3.  From 
the  earlier  argument  it  follows  that  there  is  an  input  vertex  Xi  in  a  circuit  for  f  £  Qzj  chat 
has  fan-out  2.  Let  Xi  have  that  value  that  causes  the  subfunction  /*  of  /  to  be  in  Q^l,  '  ^  • 
Fixing  Xi  eliminates  at  least  two  gates  in  the  circuit  for  /  because  each  gate  connected  to  Xi 
either  has  a  constant  output,  computes  the  identity,  or  computes  the  NOT  of  its  input.  The 
negation,  if  any,  can  be  absorbed  by  the  gate  that  precedes  or  follows  it.  Thus, 

Cntf)  >  Cn(f*)  +  2  >  2(n  -  1)  -  3  +  2  =  2n  -  3 


which  establishes  the  result.  ■ 


As  a  consequence  of  this  theorem,  the  function  ,f^0d  3  c  requires  at  least  2n  —  3  gates  over 
the  basis  B2.  It  can  also  be  shown  to  require  at  most  3n  +  0(1)  gates  [86], 

We  now  derive  a  second  lower-bound  result  using  the  gate-elimination  method.  In  this 
case  we  demonstrate  that  the  upper  bound  on  the  complexity  of  the  multiplexer  function 
/mux  :  B1  T™  1 — >  13  introduced  in  Section  2.5.5,  which  is  2n+1  +  0(ny/2™),  is  optimal  to 
within  an  additive  term  of  size  0(n\J 2”).  (The  multiplexer  function  is  also  called  the  storage 
access  function.)  We  generalize  the  storage  access  function  '■  Bn+k  1— >  B  slightly  and 

write  it  in  terms  of  a  fc-bit  address  a  and  an  n-tuple  x,  as  shown  below,  where  |a|  denotes  the 
integer  represented  by  the  binary  number  a  and  2k  >  n.\ 


_1j  •  •  •  >  a\,  ao,  xn—\, . . 


Xq)  —  %\a\ 


Thus  f(m)  - 

1  nub,  j  mux  —  J  sa 

To  derive  a  lower  bound  on  the  circuit  size  of  we  introduce  the  class  FisrL'k'1  of 

Boolean  functions  on  n  +  k  variables  defined  below. 
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DEFINITION  9.3.2  A  Boolean  function  f  :  Bn+k  i— >  B  belongs  to  the  class  Fgn'k\  2k  >  n,  if 
for  some  set  S  C  {0,  1, . . . ,  n  —  1},  151  =  s, 

f  (rife— 1)  ■  ■  •  >  CL i ,  (1q,  Xn—\,  .  .  .  ,  Xo)  —  X\a\ 

for\a,\  G  S. 

Clearly,  /g©"1  is  a  member  of  F}f’k\  "We  now  show  that  every  function  in  has  circuit 

size  that  is  at  least  2s  —  2. 

In  the  proof  of  Theorem  9.3.2  the  gate-elimination  method  replaced  variables  with  con¬ 
stants.  In  the  following  proof  this  idea  is  extended  to  replacing  variables  by  functions.  Applying 
this  result,  we  have  that  Cn(fmuf)  >  2”+1  —  1. 

THEOREM  9.3.3  Let  f  :  Bn+k  i— >  B  belong  to  Fgn’k\  2k  >  n.  Then  over  the  basis  B2  the 
circuit  size  of  f  satisfies  the  following  bound: 

CaU)  >2s-2 

Proof  In  the  proof  of  Theorem  9.3.2  we  used  the  fact  that  some  input  variable  has  fan-out 
2  or  more,  as  deduced  from  a  property  of  functions  in  Q23  ■  This  fact  does  not  hold  for  the 
storage  access  function  (multiplexer),  as  can  be  seen  from  the  construction  in  Section  2.5.5. 
Thus,  our  lower-bound  argument  must  explicitly  take  into  account  the  fact  that  the  fan-out 
from  some  input  can  be  1 . 

The  following  proof  uses  the  fact  that  the  basis  B2  contains  functions  of  two  kinds,  AND- 
type  and  parity-type  functions.  The  former  compute  expressions  of  the  form  ( xa  A  yb)c  for 
Boolean  constants  a,  b,  c,  where  the  notation  xc  denotes  x  when  c  =  1  and  x  when  c  =  0. 
Parity-type  functions  compute  expressions  of  the  form  x  ©  y  ©  c  for  some  Boolean  constant 
c.  (See  Problem  9.19.) 

The  proof  is  by  induction  on  the  value  of  s.  In  the  base  case  s  =  1  and  the  lower  bound 
is  trivially  0.  The  inductive  hypothesis  assumes  that  for  s  =  s'  —  1,  Cq(/)  >  2 (s'  —  1)  —  2. 
We  let  s  =  s1  and  consider  the  following  mutually  exclusive  cases: 

a)  For  some  i  G  S,  Xi  has  fan-out  2.  Replacing  Xi  by  a  constant  allows  elimination  of 
at  least  two  gates,  replaces  S  by  S  —  {z},  which  has  size  s'  —  1,  and  reduces  /  to 
f*  G  f©© ,  from  which  we  conclude  that 

Cn(f)  >2  +  Cnif)  >  2s1 +  2  =  2s -2 

b)  For  some  i  G  S,  Xi  has  fan-out  1,  its  unique  successor  is  a  gate  G  of  AND-type,  and  G 
computes  the  expression  (x“  A  gb)c  for  some  function  g  of  the  inputs.  Setting  x,  =  a 
sets  x“  =  aa  =  0,  thereby  causing  the  expression  to  have  value  0C,  which  is  a  constant. 
Since  G  cannot  be  the  output  gate,  this  substitution  allows  the  elimination  of  G  and  at 
least  one  successor  gate,  reduces  /  to  /*  G  f©© ,  ancj  replaces  S'  by  5  —  {!},  from 
which  the  lower  bound  follows. 

c)  For  some  i  G  S,  Xi  has  fan-out  1,  its  unique  successor  is  a  gate  G  of  parity-type,  and 
G  computes  the  expression  Xj  ®  g  ®  c  for  some  function  g  of  the  inputs.  Replace  S  by 
S—  {*}.  Since  we  ask  that  the  output  of  the  circuit  be  X|a|  fora  G  S  —  {*},  this  output 
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cannot  depend  on  the  value  of  G  because  a  change  in  xt  would  cause  the  value  of  G  to 
change.  Thus,  G  is  not  the  output  gate  and  when  a  £  S  —  {*}  we  can  set  its  value  to 
any  function  without  affecting  the  value  computed  by  the  circuit.  In  particular,  setting 
Xi  =  g  causes  G  to  have  value  c,  a  constant.  This  substitution  allows  the  elimination  of 
G  and  at  least  one  successor  gate,  and  reduces  /  to  f*  €  ,  from  which  the  lower 

bound  follows. 

Thus,  in  all  cases,  Cq(/)  >  2 s'  —  2.  ■ 

The  lower  bounds  given  above  are  derived  for  two  functions  over  the  basis  f?2-  The  best 
circuit-size  lower  bound  that  has  been  derived  for  this  basis  is  3 (n  —  1).  When  the  basis 
is  restricted,  larger  lower  bounds  may  result,  as  mentioned  in  the  notes  and  illustrated  by 
Problems  9.22  and  9.23. 


9.4  Lower-Bound  Methods  for  Formula  Size 

Since  formulas  correspond  to  circuits  of  fan-out  1 ,  the  formula  size  of  a  function  may  be  much 
larger  than  its  circuit  size.  In  this  section  we  introduce  two  techniques  for  deriving  lower 
bounds  on  formula  size  that  illustrate  this  point.  Each  leads  to  bounds  that  are  quadratic  or 
nearly  quadratic  in  the  number  of  inputs.  The  first,  due  to  Neciporuk  [230],  applies  to  any 
complete  basis.  The  second,  due  to  Krapchenko  [174],  applies  to  the  standard  basis  IIq- 

To  fix  ideas  about  formula  size,  we  construct  a  circuit  of  fan-out  1  for  the  indirect  storage 
access  function  :  Bk+lK+L  i— >  B,  where  K  =  2k  and  L  =  2l: 

fisA  ia’  xk- l,  ■  ..,x0,y)  =  y\xl(M  \ 

Here  a  is  a  k- tuple,  Xj  =  (xjj-i is  an  1-tuple  for  0  <  j  <  K  —  1,  and 
y  =  (t/L-i, . . .  ,yo)  is  an  L-tuple.  The  value  of  is  computed  by  indirection;  that  is, 
the  value  of  a  is  treated  as  a  binary  number  with  value  |a|  that  is  used  to  select  the  |a|th 
1-tuple  x\a\\  this,  in  turn,  is  treated  as  a  binary  number  and  its  value  is  used  to  select  the 
|®|a|  |th  variable  in  y. 

A  circuit  realizing  /jgj^  from  multiple  copies  of  the  multiplexer  (direct  storage  access 
function)  /mux  :  B2  +n  i— >  B  is  shown  schematically  in  Fig.  9.5.  This  circuit  uses  1  copies 
of  /mux  '■  B2  +k  B  and  one  copy  of  /mux  '■  B2  +l  i— >  B.  The  copies  of  /mi  produce 
the  |o|th  /-tuple,  which  is  supplied  to  the  copy  of  /mux  to  select  a  variable  from  y.  Since,  as 
shown  in  Lemma  2.5.5,  the  function  /mux  can  be  realized  by  a  circuit  of  size  linear  in  2k,  a 
circuit  for  fjgf1  can  be  constructed  that  is  also  linear  in  the  size  of  its  input. 

A  formula  for  has  fan-out  of  1  from  every  gate.  The  circuit  sketched  in  Fig.  9.5  has 
fan-out  1  if  and  only  if  the  fan-out  within  each  multiplexer  circuit  is  also  1.  To  construct  a 
formula  from  this  circuit,  we  first  construct  one  for  /mux-  The  total  number  of  times  that 
address  bits  appear  in  a  formula  for  /mux  determines  the  number  of  copies  of  the  formula  for 
/mux  that  are  used  in  the  formula  for  fj ^ .  A  proof  by  induction  can  be  developed  to  show 
that  a  formula  for  /mux  can  be  constructed  of  size  32p  — 2  in  which  address  bits  occur  2(2P  —  1) 
times.  (See  Problem  9.24.)  Since  each  occurrence  of  an  address  bit  in  /mux  corresponds  to  a 
copy  of  the  formula  for  /mux,  by  choosing  L  =  2l  =  n  and  k  the  smallest  integer  such  that 
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xk-u-i  xq ti~i  xk-  1,1-2  xo,i-i  xk- i.o  £o,o 


Figure  9.5  The  schema  used  to  construct  a  circuit  of  fan-out  1  for  the  indirect  storage  access 
function  . 


K  =  2k  >  n/l  we  see  that  fjgf1  has  2l  +  I2k  +  k  =  0{n )  variables  and  that  its  formula  size  is 
2(2(  -  l)Ln  |/muxj  +Lq  |/mixj  ,  which  is  0(n2  /  log2  n),  as  summarized  in  Lemma  9.4.1. 

LEMMA  9.4. 1  Let  2l  =  n  and  k  =  |~log2  n/l] .  Then  the  formula  size  of  f[ '■  Bk+lK+L  i— > 
B  satisfies  the  following  bound: 


Ln  (/isa)  =  0(n2/\og2n) 

We  now  introduce  Neciporuk’s  method,  by  which  it  can  be  shown  that  this  bound  for 
(k  l)  .  . 

/iSA  *s  optimal  to  within  a  constant  multiplicative  factor. 

9.4. 1  The  Neciporuk  Lower  Bound 

The  Neciporuk  lower-bound  method  uses  a  partition  of  the  variables  X  =  (*i ,  x2, . . . ,  x„ )  of 
a  Boolean  function  / ^  :  Bn  i— >  B  into  disjoint  sets  X\,  X2, . . . ,  Xp.  That  is,  X  =  (J£_j  Xi 
and  Xi  D  Xj  =  0  for  i  f  j.  The  lower  bound  on  the  formula  size  of  /  is  stated  in  terms  of 
rXj  (/),  0  <  j  <  j>,  the  number  of  subfunctions  of  /  when  restricted  to  variables  in  ©• 
That  is,  rXj  (/)  is  the  number  of  different  subfunctions  of  /  in  the  variables  in  Xj  obtained 
by  ranging  over  all  values  for  variables  in  X  —  Xj . 

We  now  describe  Neciporuk’s  lower  bound  on  formula  size.  We  emphasize  that  the  strength 
of  the  lower  bound  depends  on  which  partition  X\,  X2,  ■  ■  ■ ,  Xp  of  the  variables  X  is  chosen. 
After  the  proof  we  apply  it  to  the  indirect  storage  access  function.  The  method  cannot  provide  a 
lower  bound  that  is  larger  than  0(n2/  log  n)  for  a  function  on  n  variables.  (See  Problem  9.25.) 
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THEOREM  9.4. 1  For  every  complete  basis  Fl  there  is  a  constant  Cq  such  that  for  every  function 
/(")  :  Bn  i— >  B  and  every  partition  of  its  variables  X  into  disjoint  sets  X\,  Xj, . . .  ,Xp,  the 
formula  size  of  f  with  respect  to  tt  satisfies  the  following  lower  bound: 

p 

Lnif)  >  cn^log 2rXj{f) 
i= i 

Proof  Consider  T,  a  minimal  circuit  of  fan-out  1  for  /.  Let  rij  be  the  number  of  instances 
of  variables  in  Xj  that  are  labels  for  leaves  in  T.  Then  by  definition  Lq(/)  = 

Let  d  be  the  fan-in  of  the  basis  f  l. 

For  each  j,  1  <  j  <  p,  we  define  the  subtree  Tj  of  T  consisting  of  paths  from  vertices 
with  labels  in  Xj  to  the  output  vertex,  as  suggested  by  the  heavy  lines  in  Fig.  9.6.  We 
observe  that  some  vertices  in  such  a  subtree  have  one  input  from  a  vertex  in  the  subtree  Tj 
(called  controllers  —  shaded  vertices  in  Fig.  9.6)  whereas  others  have  more  than  one  input 
from  a  vertex  in  Tj  (combiners  —  black  vertices  in  Fig.  9.6).  Each  type  of  vertex  typically 
has  inputs  from  vertices  other  than  those  in  Tj,  that  is,  from  vertices  on  paths  from  input 
vertices  in  X  —  Xj . 

When  the  variables  X  —  Xj  are  assigned  values,  the  output  of  a  controller  or  com¬ 
biner  vertex  depends  only  on  the  inputs  it  receives  from  other  vertices  in  Tj .  The  function 
computed  by  a  controller  is  a  function  of  its  one  input  y  in  Tj  and  can  be  represented  as 
(a  A  y)  ®  b  for  some  values  of  the  constants  a  and  b.  These  constants  are  determined  by 
the  values  of  inputs  in  X  —  Xj .  We  assume  without  loss  of  generality  that  each  chain  of 
controllers  with  no  intervening  combiners  is  compressed  to  one  controller.  The  combiner  is 
also  some  function  of  its  inputs  from  other  vertices  in  Tj.  Since  the  number  of  such  inputs 
is  as  least  2,  a  combiner  (with  fan-in  at  most  d )  has  at  most  d  —  2  inputs  determined  by 
variables  in  X  -  Xj. 


Figure  9.6  The  subtree  Tj  of  the  tree  T  is  identified  by  heavy  edges  on  paths  from  input  vertices 
in  the  set  Xj  =  {*1,3:3}.  Vertices  in  Tj  that  have  one  heavy  input  edge  are  controller  vertices. 
Other  vertices  in  / are  combiner  vertices. 
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By  Lemma  9.2. 1 ,  since  Tj  has  rij  leaves,  the  number  of  vertices  with  fan-in  of  2  or  more 
(combiners)  is  at  most  rij  —  1 .  Also,  by  Lemma  9.2.1,  Tj  has  at  most  2 (rij  —  1)  edges.  Since 
Tj  may  have  one  controller  at  the  output  and  at  most  one  per  edge,  Tj  has  at  most  2 rij  —  1 
controllers. 

The  number  of  functions  computed  by  a  combiner  is  at  most  one  of  2d~2  since  at  most 
c?  —  2  of  its  inputs  are  determined  by  variables  in  X  —  Xj .  At  most  four  functions  are 
computed  by  a  controller  since  there  are  at  most  four  functions  on  one  variable.  It  follows 
that  the  tree  Tj  associated  with  the  input  variables  in  Xj  containing  rij  leaves  computes 
rxj  different  functions  where  T‘x:l  satisfies  the  following  upper  bound.  This  bound  is  the 
product  of  the  number  of  ways  that  each  of  the  controllers  and  combiners  can  compute 
functions. 

rx .(/)  <  <  2(d+2)"j 

Thus,  (d  +  2 )n,j  >  log2  rxAf)-  Since  Ln(f)  =  nj>  the  theorem  holds  for  cq  = 

l/(d  +  2).  ■ 


Applying  Neciporuk’s  lower  bound  to  the  indirect  storage  access  function  yields  the  fol¬ 
lowing  result,  which  demonstrates  that  the  upper  bound  given  in  Lemma  9.4.1  for  the  indirect 
storage  access  function  is  tight. 

LEMMA  9.4.2  Let2l  =  n  and  k  =  |~log2(n/Z)~|.  The  formula  size  of  /]!©  :  fik+lK+L  |_>  ,g 
satisfies  the  following  bound: 

Ln  (f^)  =  ^ 

Proof  Let  p  =  K  =  2k  and  let  Xj  contain  Xj .  If  Xj  contains  other  variables,  these  are 
assigned  fixed  values,  which  cannot  increase  rxAf)-  For  0  <  j  <  K  —  1,  set  |a|  =  j. 
f  has  at  least  2L  restrictions  since  for  each  of  the  2L  assignments  to  (i/l- i,  ■  ■  ■ ,  yo)  the 
restriction  of  /  is  distinct;  that  is,  if  two  different  such  L-tuples  are  supplied  as  input,  they 
can  be  distinguished  by  some  assignment  to  Xj.  Thus  rxAf)  >  2L .  Hence,  the  formula 

size  of  fjgx  ,  Lq  (  fisA  )  A.  cqKL,  which  is  proportional  to  n2/  log  n.  ■ 


9.4.2  The  Krapchenko  Lower  Bound 

Krapchenko’s  lower  bound  applies  to  the  standard  basis  Ho  or  any  complete  subset,  namely 
{A,  ->}  and  {V,  ->}.  It  provides  a  lower  bound  on  formula  size  that  can  be  slightly  larger  than 
that  given  by  Neciporuk’s  method. 

We  apply  Krapchenko’s  method  to  the  parity  function  fjf1'1  :  Bn  i— >  £>,  where  /©'*  {x\,  Xi, 
. . .  ,xn)  =  X\  ®  X2  ®  •  •  •  ®  xn,  to  show  that  its  formula  size  is  quadratic  in  n.  Since  the  parity 
function  on  two  variables  can  be  expressed  by  the  formula 

{x UX2)  =  (xi  Ail)  v  ©  Ax2) 


it  is  straightforward  to  show  that  the  formula  size  of  fAA  is  at  most  quadratic  in  n. 
Problem  9.26.) 


(See 
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DEFINITION  9.4. 1  Given  two  disjoint  subsets  A,  B  C  {0,1}"  of  the  set  of  the  Boolean  n-tuples, 
the  neighborhood  of  A  and  B,  J\f(A,B),  is  the  set  of  pairs  of  tuples  (x,y),  x  £  A  and 
y  £  B,  such  that  x  and  y  agree  in  all  but  one  position. 


The  neighborhood  of  A  =  {0}  and  B  =  {1}  is  the  pair  J\f(A,  B)  =  {(0, 1)}.  Also, 
the  neighborhood  of  A  =  {000,101}  and  B  =  {111,010}  is  the  set  of  pairs  A/”(  A,  B)  = 
{(000,010),  (101,  111)}. 

Given  a  function  /  :  Bn  <— »  B,  we  use  the  notation  /_1( 0)  and  /_1(l)  to  denote  the  sets 
of  n-tuples  that  cause  /  to  assume  the  values  0  and  1 ,  respectively. 

THEOREM  9.4.2  For  any  f  :  Bn  t— >  B  and  any  A  C  /_1(  0)  and  B  C  /-1(l),  the  following 
inequality  holds  over  the  standard  basis  Oq-' 


Ln0(f)  > 


MA,B)\2 

\A\\B\ 


Proof  Consider  a  circuit  for  /  of  fan-out  1  over  the  standard  basis  that  has  the  mini¬ 
mal  number  of  leaves,  namely  Lq0(/).  Since  the  fan-in  of  each  gate  is  either  1  or  2,  by 
Lemma  9.2.1  the  number  of  leaves  is  one  more  than  the  number  of  gates  of  fan-in  2.  Each 
fan-in-2  gate  is  an  AND  or  OR  gate  with  suitable  negation  on  its  inputs  and  outputs. 

Consider  a  minimal  formula  for  /.  Assume  without  loss  of  generality  that  the  formula 
is  written  over  the  basis  {A,^}.  We  prove  the  lower  bound  by  induction,  the  base  case 
being  that  of  a  function  on  one  variable.  If  the  function  is  constant,  \Jf(A,  B)\  =  0  and 
its  formula  size  is  also  0.  If  the  function  is  non-constant,  it  is  either  x  or  x.  (If  f(x)  =  x, 
/_1(  1)  =  {1}  and/_1(0)  =  {0}.)  In  both  cases,  \Af(A,B)\  =  1  since  the  neighborhood 
has  only  one  pair.  (In  the  first  case  Af( A,  B)  =  {(0,1)}.)  Also,  |A|  =  1  and  \B\  =  1, 
thereby  establishing  the  base  case. 

The  inductive  hypothesis  is  that  Lfi0(/*)  >  \jf(A,  f3)|/|A||B|  for  any  function  f* 
whose  formula  size  Lq0(/*)  <  Lq  —  1  for  some  Lq  >  2.  Since  the  occurrences  of  NOT 
do  not  affect  the  formula  size  of  a  function,  apply  DeMorgan’s  theorem  as  necessary  so  that 
the  output  gate  of  the  optimal  (minimal-depth)  formula  for  /  is  an  AND  gate.  Then  we  can 
write  /  =  g  A  h,  where  g  and  h  are  defined  on  the  variables  appearing  in  their  formulas. 
Since  the  formula  for  /  is  optimal,  so  are  the  formulas  for  g  and  h. 

Let  A  C  /-1( 0)  and  B  C  /-1(l).  Thus,  f(x)  =  0  for  x  £  A  and  f(x)  =  1  for 
x  £  B.  Since  /  =  g  A  h,  if  f(x)  =  1,  then  both  g(x)  =  1  and  h(x)  =  1.  That  is, 
/-1(l)  C  <7—  1  ( 1)  and  /"'(I)  C  fi-1}!).  (See  Fig.  9.7.)  It  follows  that  B  C  cr1}!)  and 
B  C  h~l(  1).  Let  B\  —  B2  =  B.  Let  A\  =  A  l~l  g_1(0)  (which  implies  A\  C  g~1(0)) 
and  let  A2  =  A  —  Ai.  Since  f(x)  =  0  for  x  £  A,  but  g(x)  =  1  for  x  £  A2,  as  suggested 
in  Fig.  9.7,  it  follows  that  A2  C  h_1(0).  (Since  /  =  g  A  h,  f{x)  =  0,  and  g(x)  =  1,  it 
follows  that  h(x)  =  0.)  Finally,  observe  that  Af(A\,  Bf)  and  Af(A2,  Bi)  are  disjoint  (Aj 
and  A2  have  no  tuples  in  common)  and  that  \Af(A,B)  \  =  \N(A\,  i?i)|  +  \J\f  (A2,  B2)\. 

Given  the  inductive  hypothesis,  it  follows  from  the  above  that 


Lnfg)  +  Ln0{h)  > 


\AT(A\,B\)\2  |W(A2,B2)|2 

l^tHBtl  |A2||f?2| 

1  (\N{Ax,Bf\2  \U(A2,B2)\2\ 

\B\  \  \Ai\  +  |A2|  ) 
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Figure  9.7  The  relationships  among  the  sets  /  '(!),(?  ’(l),/i  1  (1),  Aj,  and  h  1  (0) . 


By  the  identity  n\/a\  +  n2/ a2  >  (rii  +  n2)2/ («i  +  02)1  which  holds  for  positive  integers 
(see  Problem  9.3),  the  desired  result  follows  because  \A\  =  |Ai|  +  1^2 1-  ■ 

Krapchenko’s  method  is  easily  applied  to  the  parity  function  /©.  We  need  only  let  A 
( B )  contain  n-tuples  having  an  even  (odd)  number  of  l’s.  (|A|  =  \B\  =  2"  .)  Then 

\M{A,B)\  =  n 2"_I  because  for  any  vector  in  A  there  are  exactly  n  vectors  in  B  that  are 

neighbors  of  it.  It  follows  that  Lq0  ^  >  n2. 

9.5  The  Power  of  Negation 

As  a  prelude  to  the  discussion  of  monotone  circuits  for  monotone  functions  in  the  next  sec¬ 
tion,  we  consider  the  minimum  number  of  negations  necessary  to  realize  an  arbitrary  Boolean 
function  /  :  Bn  1— >  Bm.  From  Problem  2.12  on  dual-rail  logic  we  know  that  every  such 
function  can  be  realized  by  a  monotone  circuit  in  which  both  the  variables  X\,  X2, . . .  ,xn  and 
their  negations  X\,X2, . . .  ,xn  are  provided  as  inputs.  Furthermore,  every  such  circuit  need 
have  only  at  most  twice  as  many  AND  and  OR  gates  as  a  minimal  circuit  over  flo,  the  standard 
basis.  Also,  the  depth  of  the  dual-rail  logic  circuit  of  a  function  is  at  most  one  more  than  the 
depth  of  a  minimal-depth  circuit,  the  extra  depth  being  that  to  form  X\,  X2,  ■  ■  ■ ,  xn. 

Let /neg  :  ^  Bn  be  defined  by  /neg(^i.®2.  ••  ->xn)  =  {xux2,  ■  ■  ■  ,xn).  As 

shown  in  Lemma  9.5.1,  this  function  can  be  realized  by  a  circuit  of  size  0(n2logn)  and 
depth  0(log2  n)  over  fl0  using  [log2(n  +  1)]  negations.  This  implies  that  most  Boolean 
functions  on  n  variables  can  be  realized  by  a  circuit  whose  size  and  depth  are  within  a  factor  of 
about  2  of  their  minimal  values  when  the  number  of  negations  is  [log2(n  +  1)] . 

THEOREM  9.5. 1  Every  Boolean  function  on  n  variables,  f  :  Bn  1— >  Bm,  can  be  realized  by  a 
circuit  containing  at  most  [log2(n  +1)]  negations.  Furthermore,  the  minimal  size  and  depth  of 
such  circuits  is  at  most  2C'n0(/)  +  0(n2\ogn)  and  Dq0(/)  +  0(\og2  n),  respectively,  where 
Cn„  (/)  and  79q0  (/)  are  the  circuit  size  and  depth  of  f  over  the  standard  basis  fio- 
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Proof  The  proof  follows  directly  from  the  dual-rail  expansion  of  Problem  2.12  and  the 
following  lemma.  ■ 

We  now  show  that  the  function  /^eg  :  l— *  defined  by  /n eg(Xi’  X2’  •  •  •  >  xn)  = 

(xi,X2,  ■  ■  ■  ,xn)  can  be  realized  by  circuit  size  ofO(n2logn)  over  fl0  using  |~log2(n  +  1)] 
negations. 

LEMMA  9.5.1  :  Bn  i— >  Bn  can  be  realized  with  [log2(n  +  1)]  negations  by  a  circuit  over 

the  standard  basis  that  has  size  0(n2  log  n)  and  depth  0(log  n). 

Proof  The  punctured  threshold  function  :  Bn  h->  B,  1  <  t ,  i  <  n,  is  defined  below. 

(n)  /  n  j  1  =  xj  —  t 

T‘-‘W  =  0  otherwise 


This  function  has  value  1  if  t  or  more  of  the  variables  other  than  Xi  have  value  1.  The 
standard  threshold  function  :  Bn  i— >  B  has  value  1  when  t  or  more  of  the  variables 
have  value  1.  Since  the  function  (tq"  . .  . ,  ^  j  is  the  result  of  sorting  all  but 

the  zth  input,  we  know  from  Theorem  6.8.3  that  Batcher’s  bitonic  sorting  algorithm  will 
produce  this  output  with  a  circuit  of  size  0{n  log2  n)  and  depth  0(log2  n )  because  max 
and  min  of  a  comparator  unit  compute  AND  and  OR  on  binary  inputs.  Ajtai,  Komlos,  and 
Szemeredi  [14]  have  improved  this  bound  to  0(n  log  n)  but  with  a  very  large  coefficient, 
and  simultaneously  achieve  depth  O(logn).  Thus,  all  the  functions  {  rt,"i  1 1  <t,i<n} 
can  be  realized  with  0(n 2  log  n )  gates  and  depth  0(log  n )  over  f l0. 

Observe  that  for  input  x  there  is  some  largest  t,  t  =  to,  such  that  t^\x)  =  1.  If 
Tl"li(x)  =  1,  then  Xi  =  0;  otherwise,  Xi  =  1.  Let  the  implication  function  a  =y  b 
have  value  1  when  a  =  0  or  when  a  =  1  and  6=1  and  value  0  otherwise.  Then  we 
can  express  the  implication  function  by  the  formula  (a  =>  6)  =  a  V  6.  It  follows  that 
Xi  =  (rj^(x)  =>  (x))  because  the  implication  function  has  value  1  exactly  when 

Xi  =  0. 

We  use  an  indirect  method  to  compute  to.  Since  rj:n\x)  =  0  for  t  >  t0,  (jtn\x)  => 
T~t™l(x))  =  1  for  t  >  t,0.  Also,  both  r[n\x)  and  have  value  1  for  t  <  to-  Using 

(x  =>-  y)  =  xVy,  we  can  write  Xi  as  follows: 


X{  = 


(  Ton\x)  V  A  (  r[n\x)  V  T^ix))  A  •  •  ■  A  (  t^x)  V  r^\  ^{x)} 


The  circuit  design  is  complete  once  a  circuit  for  {T^n\x)  |  1  <  t  <  n}  has  been 
designed.  We  begin  by  using  a  binary  sorting  circuit  that  computes  {tj’^(x)  |  1  <  t  <  n} 
from  x,  which,  as  stated  above,  can  be  computed  with  0(n  log2  n)  gates  over  the  standard 
basis.  Let  s*  =  rj;n\x)  for  1  <  t  <  n. 

For  n  =  K  -  1,  K  =  2k  and  k  an  integer,  we  complete  the  design  by  constructing 
a  circuit  for  the  function  v ^  :  Bn  i— >  Bn,  which,  given  as  input  the  decreasing  sequence 
Si,  s2, . . . ,  sn  ( Si  >  Si+i),  computes  as  its  jth  output  Zj  =  Sj,  1  <  j  <  n.  (The  case 

n  yf  2k  —  1  is  considered  below.)  That  is,  v^k\s)  =  z,  where  Zt  =  r[n\x).  We  give 
a  recursive  construction  of  a  circuit  for  whose  correctness  is  established  by  induction. 
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-1  z{K/2)-\  Z(K/  2)  z(K/2)+l  ZK- 1 


Figure  9.8  A  circuit  for  v ^  i — >  6™,  n  =  K  —  1,  K  =  2fe.  It  is  given  the  sorted  n-tuple 
s  as  input,  where  Sj  >  Sj+i  for  1  <  j  <J  n,  and  produces  as  output  a,  where  Zj  =  Sj. 


The  base  case  is  a  circuit  for  z/1)  .  This  circuit  has  one  input,  Si,  and  one  output,  Z\  =  Sj, 
and  can  be  realized  by  one  negation  and  no  other  gates. 

We  construct  a  circuit  for  v^k>  from  one  for  using  2 n  additional  gates  and  in¬ 

creasing  the  depth  by  three,  as  shown  in  Fig.  9.8.  Let  the  inputs  and  outputs  to  the  circuit 
for  zz(fe_I)  be  s*  and  z* ,  1  <  i  <  K*  —  1,  where  K*  =  K/2.  It  follows  that  s*  >  s*+1 
for  1  <  *  <  {K/2)  —  1.  By  induction  z *  =  s*,;  for  1  <  i  <  n. 

To  show  that  the  jth  output  of  the  circuit  for  is  Zj  =  Sj,  we  consider  cases.  If 
Sjfe- 1  =  0,  then  Sj  =  0  for  j  >  K/2.  In  this  case  the  jth  circuit  output,  {K/2)  <  j  < 
K  —  1,  satisfies  Zj  =  1  (the  corresponding  output  gate  is  OR),  which  is  the  correct  value. 
Also,  for  1  <  j  <  {K/2)  —  1,  Zj  =  z*  =  Sj  since  the  inputs  to  the  circuit  for  z/fc-L  are 
Si,  S2,  ■  •  - ,  S(if/2)-i  (Sj  =  0  for  j  >  AT/2)  and  its  outputs  are  Si,  S2, ...,  S(if/2)-i-  On 
the  other  hand,  if  Sk/2  =  l,thensj  =  1  andz^  =  Oforj  <  {K/2)  — l  (the  corresponding 
output  gate  is  AND).  Also,  for  {K/2)  +  1  <  j  <  K  —  1,  Zj  =  Zj  =  Sj  since  the  inputs  to 
the  circuit  for  zy(fc_1l  are  S(k/2)+u  ■  ■  ■  >  SK- 1  and  its  outputs  are  ~S(k/2)+\>  ■  •  ■  >  S(zc/2)-i  ■ 

It  follows  that  k  =  log 2{n  +1)  negations  are  used.  The  circuit  for  uses  a  total  of 
C{k)  =  C{k  —  1)+  2fc+1  —  3  gates,  where  C(l)  =  1.  The  solution  to  this  recurrence 
is  C{k)  =  4(2fc)  —  3 k  —  4  =  4n  —  3 log 2n  —  4.  Also,  the  circuit  for  has  depth 
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D(k )  =  D(k—l)+4,  where  D(\)  =  0.  The  solution  to  this  recurrence  is  D{k)  =  4(fc— 1). 
If  n  is  not  of  the  form  2k  —  1,  we  increase  n  to  the  next  largest  integer  of  this  form,  which 
implies  that  k  =  |~log2(n  +  1)] .  Using  the  upper  bounds  on  the  size  of  circuits  to  compute 
for  I  <  t,i  <  n,  we  have  the  desired  conclusion.  ■ 


9.6  Lower-Bound  Methods  for  Monotone  Circuits 

The  best  lower  bounds  that  have  been  derived  on  the  circuit  size  over  complete  bases  of  Boolean 
functions  on  n  variables  are  linear  in  n.  Similarly,  the  best  lower  bounds  on  formula  size  that 
have  been  derived  over  complete  bases  are  at  best  quadratic  in  n.  As  a  consequence,  the  search 
for  better  lower  bounds  has  led  to  the  study  of  monotone  circuits  (their  basis  is  f2mon)  for 
monotone  functions.  In  one  sense,  this  effort  has  been  surprisingly  successful.  Techniques 
have  been  developed  to  show  that  some  monotone  functions  have  exponential  circuit  size. 
Since  most  monotone  Boolean  functions  on  n  variables  have  circuit  size  0(2n/n3/2),  this  is 
a  strong  result.  On  the  other  hand,  the  hope  that  such  techniques  would  lead  to  strong  lower 
bounds  on  circuit  size  for  monotone  functions  over  complete  bases  has  not  yet  been  realized. 

Some  monotone  functions  are  very  important.  Among  these  are  the  clique  function 
/clique, k  ’  £>”('n_1^2  i— »  B.  f clique, k  ‘s  associated  with  a  family  of  undirected  graphs 
G  =  ( V ,  E)  on  n  =  \V\  vertices  and  \E\  <  n{n  —  1) /2  edges,  where  V  =  { 1,  2,  3, . . . ,  n}. 
The  variables  of  f^que  k  are  denoted  {xij  |  1  <  i  <  j  <  n},  where  Xij  =  1  if  there  is  an 
edge  between  vertices  i  and  j  and  x^j  =  0  otherwise.  The  value  of  /^qUe  k  on  these  variables 
is  1  if  G  contains  a  fc -clique,  a  set  of  k  vertices  such  that  there  is  an  edge  between  every  pair  of 
vertices  in  the  set.  The  value  of  /c2"qUe  k  is  0  otherwise.  Clearly  /c^”que  k  is  monotone  because 
increasing  the  value  of  a  variable  from  0  to  1  cannot  decrease  the  value  of  the  function. 

As  stated  in  Problem  8.24,  the  CLIQUE  problem  is  NP-complete.  Since  an  instance  of 
CLIQUE  on  a  graph  with  n  vertices  can  be  converted  to  the  input  format  for  /c^qUe  k  in  time 
polynomial  in  n,  if  the  circuit  size  for  /c^"qUe  k  over  a  complete  basis  can  be  shown  to  be 
superpolynomial,  then  from  Corollary  3.9.1,  P  yf  NP. 

There  are  important  similarities  and  differences  between  monotone  and  non-monotone 
functions.  Every  non-monotone  function  can  be  realized  by  a  circuit  over  the  standard  basis 
n0  in  which  negations  are  used  only  on  inputs.  (See  Problem  9.1 1.)  On  the  other  hand,  since 
circuits  without  negation  compute  only  monotone  functions  (Problem  2),  negations  on  inputs 
are  essential. 

The  first  results  showing  the  existence  of  monotone  functions  such  that  their  monotone 
and  non-monotone  circuit  sizes  are  different  were  obtained  for  multiple-output  functions.  We 
illustrate  this  approach  below  for  the  n-input  binary  sorting  function,  ,  whose  monotone 
circuit  size  is  shown  to  be  0  (n  log  n) .  As  stated  in  Problem  2.17,  this  function  can  be  realized 
by  a  circuit  whose  size  over  flo  is  linear  in  n. 

We  introduce  the  path  method  to  show  that  a  gap  exists  between  the  monotone  and  non¬ 
monotone  circuit  size  of  a  family  of  functions.  In  Section  9.6.3  the  approximation  method 
is  introduced  and  used  to  show  that  the  clique  function  /c^"qUe  k  has  exponential  monotone 
circuit  size. 
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In  this  section  we  illustrate  the  path-elimination  method  for  deriving  lower  bounds  on  circuit 
size  for  monotone  functions.  This  method  demonstrates  that  a  path  of  gates  in  a  monotone 
circuit  can  be  eliminated  by  fixing  one  input  variable.  Thus,  it  is  the  monotone  equivalent 
of  the  gate-elimination  method  for  general  circuits.  We  apply  the  method  to  two  problems, 
binary  sorting  and  binary  merging. 

Consider  computing  the  binary  sorting  function  /©t  :  Bn  i— >  Bn  introduced  in  Sec¬ 
tion  2.1 1.  This  function  rearranges  the  bits  in  a  binary  n-input  string  into  descending  order. 
Thus,  the  first  sorted  output  is  1  if  one  or  more  of  the  inputs  is  1,  the  second  is  1  if  two  or  more 
of  them  are  1 ,  etc.  Consequently,  we  can  write  /s©t  (©,  x2, . . . ,  x„)  =  (r©  ,  r2  "  \  . . . ,  r© ) , 
where  x©  is  the  threshold  function  on  n  inputs  with  threshold  t  whose  value  is  1  if  t  or  more 
of  its  inputs  are  1  and  0  otherwise.  Ajtai,  Komlos,  and  Szemeredi  [14]  have  shown  the  exis¬ 
tence  of  a  comparator-based  sorting  network  on  n  inputs  of  size  0(n  log  n).  (The  coefficient 
on  this  bound  is  so  large  that  the  bound  has  only  asymptotic  value.)  Such  networks  can  be 
converted  to  a  monotone  network  by  replacing  the  max  and  min  operators  in  comparators 
with  OR  and  AND,  respectively. 

THEOREM  9.6. 1  The  monotone  circuit  size  for  /©t  satisfies  the  following  hounds: 

n[log2nl  -  2 f log2 "1  <  CQmon  (/©)  =  O(nlogn) 

Proof  To  derive  the  lower  bound,  we  show  that  in  any  circuit  for  /s©t  there  is  an  input 
variable  that  can  be  set  to  1 ,  thereby  allowing  at  least  [log2  n~\  gates  along  a  path  from  it  to 
the  output  T©  to  be  removed  from  the  circuit  and  converting  the  circuit  to  one  for  /b©  T 
As  a  result,  we  show  the  following  relationship: 

Cfimon  ( / sort )  >  Comon  (/s("rt1})  +  Pog2n] 

A  simple  proof  by  induction  and  a  little  algebra  show  that  the  desired  result  follows  from 

(2) 

this  bound  and  the  fact  that  Cq  ( /©t )  =  2,  which  is  easy  to  establish. 

Let  Xj  =  0  for  j  f  i  but  let  Xj  vary.  The  only  functions  computed  at  gates  are  0,  1,  or 
Xi.  Also,  the  value  of  T|  (x)  on  such  inputs  is  equal  to  x^.  Consequently,  there  must  be  a 
path  P  from  the  vertex  labeled  Xi  to  T\  such  that  at  each  gate  on  the  path  the  function  Xi  is 
computed.  (See  Fig.  9.9.)  Thus,  if  we  set  Xi  =  1  when  Xj  =  0  for  j  f  i  the  output  of  each 
of  these  gates  is  1 .  Furthermore,  since  the  circuit  is  monotone,  each  function  computed  at  a 
gate  is  monotone  (see  Problem  2).  Thus,  if  any  other  input  is  subsequently  increased  from 
0  to  1 ,  the  value  of  T\  and  of  all  the  gates  on  the  path  P  from  Xj  remain  at  1  and  can  be 
removed.  This  setting  of  X*  also  has  the  effect  of  reducing  the  threshold  of  all  other  output 
functions  by  1  and  implies  that  the  circuit  now  computes  the  binary  sorting  function  on  one 
fewer  variable. 

Consider  a  minimal  monotone  circuit  for  /s©t .  The  shortest  paths  from  each  input  to 
the  output  T©  form  a  tree  of  fan-in  2.  From  Theorem  9.3.1  there  is  a  path  in  this  tree  from 
some  input,  say  xr,  to  Tj that  has  length  at  least  flog2  n] .  Consequently  the  shortest  path 
from  xr  to  r©  has  length  at  least  |~log2  n] ,  implying  that  at  least  [~log2  n]  gates  can  be 
removed  if  xr  is  set  to  1 .  ■ 
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Figure  9.9  When  Xi  ml  there  is  a  path  P  to  n  such  that  each  gate  on  P  has  value  1. 


We  now  derive  a  stronger  result:  we  show  that  every  monotone  circuit  for  binary  merging 
has  a  size  that  is  f l(n  log  n).  Binary  merging  is  realized  by  a  function  f merge  '■  Bn  i— »  Bn,  n  = 
2k,  defined  as  follows:  given  two  sorted  binary  fc-tuples  x  and  y,  the  value  of  /me [ge(x,  y) 
is  the  n-tuple  that  results  from  sorting  the  n- tuple  formed  by  concatenating  x  and  y.  Thus, 
a  binary  merging  circuit  can  be  obtained  from  one  for  sorting  simply  by  restricting  the  values 
assumed  by  inputs  to  the  sorting  circuit.  (Binary  merging  is  a  subfunction  of  binary  sorting.) 

It  follows  that  a  lower  bound  on  Cnmon  (/fnei-ge)  is  a  lower  bound  on  Cnmon  (/fort)  • 

THEOREM  9.6.2  Let  n  be  even.  Then  the  monotone  circuit  size  for  f  merge  '■  Bn  <— >  Bn  satisfies 
the  following  bounds: 

(n/2)  log2  n  -  0(n )  <  Cnmon  ge)  =  O(nlogn) 

Proof  The  upper  bound  on  Cnmon  ( /merge )  follows  from  the  construction  given  in  The¬ 
orem  6.8.2  after  max  and  min  comparison  operators  are  replaced  by  ANDs  and  ORs,  respec¬ 
tively. 

Let  k  =  n/2.  The  function  /merge  operates  on  two  fc-tuples  x  and  y  to  produce  the 
merged  result  /mJrge(x,y),  where  x  and  y  are  in  descending  order;  that  is,  X\  >  a;2  > 
•  •  •  >  Xk  and  y\  >  y2  >  •  •  •  >  yk-  As  stated  above  for  binary  sorting,  the  output  functions 
are  Ti,t2,  . . .  ,r„. 

Let  x,  =  x2  =  ■  ■  ■  =  xr-\  =  L  av+i  =  •  •  •  =  xk  =  0,  yi  =  y2  =  ■  ■  ■  =  ys  =  L 
and  ys+i  =  •  ■  ■  =  yk  =  0.  Let  xr  be  unspecified.  Since  the  circuit  is  monotone,  the  value 
computed  by  each  gate  circuit  is  0,  1,  or  xr.  Also, 

{1  t  <  r  +  s 

xr  t  =  r  +  s 

0  t  >  r  +  s 

It  follows  that  there  must  be  a  path  pfr+s'1  of  gates  from  the  input  labeled  xr  to  the 
output  labeled  rr+s  such  that  each  gate  output  is  xr.  If  xr  =  0,  since  the  components  of  x 
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Figure  9.10  Let  fmeige(x,  y)  =  (n , . . . ,  Tn),  where  a;  and  y  are  (n/2)-tuples.  The  dots  in 
the  j  th  row  show  the  inputs  on  which  Tj  depends.  e(j)  is  the  number  of  dots  inthejthrow. 


are  sorted,  xr+\  =  •  •  •  =  Xk  =  0.  On  the  other  hand,  if  xr  =  1,  by  monotonicity  the  value 
of  rr+s  cannot  change  under  variation  of  the  values  a;r+i, . . .  ,Xk-  Thus,  Tj  is  essentially 
dependent  on  Xi  for  i  and  j  satisfying  1  <  i  <  k  and  i  <  j  <  i  +  k.  (See  Fig.  9.10.)  Let 
e(j)  denote  the  number  of  variables  in  x  on  which  Tj  depends;  then  e(  j)  =  j  for  j  <  k 
and  e(  j)  =  2k  —  j  +  1  for  j  >  k. 

We  show  by  induction  that  there  exist  vertex-disjoint  paths  between  X\  and  ts_|_i,  X2 
and  ts+ 2, . . . ,  Xk  and  rs+fc  for  0  <  s  <  k.  (See  Fig.  9.1 1.)  Thus,  there  are  k  +  1  sets  of 
vertex-disjoint  paths  connecting  the  k  =  n/2  inputs  in  x  and  k  consecutive  outputs. 


(a)  (b) 

Figure  9.11  (a)  In  a  monotone  circuit  for  f merge ,  n  =  2k,  k+ 1  sets  of  k  disjoint  paths  exist  be¬ 
tween  the  k  inputs  x  and  k  consecutive  outputs,  (b)  The  paths  to  an  output  Tj  form  a  binary  tree. 


416 


Chapter  9  Circuit  Complexity 


Models  of  Computation 


To  show  the  existence  of  the  vertex-disjoint  paths,  let  j/i  =  yi  =  •  •  •  =  ys  =  1, 
ys+ 1  =  •  •  •  =  Uk  =  0  and  X\  =  X2  =  ■  ■  ■  =  xr-\  =  1,  but  let  xr,  xr+\, . . .  ,Xk  be 
unspecified.  Then  Trj.s  =  xr  and,  as  stated  above,  there  is  a  path  _P^r+s^  of  gates  from  an 
input  labeled  xr  to  the  output  labeled  Tr+S  such  that  each  gate  has  value  xr.  Set  xr  =  1. 
Reasoning  as  before,  there  must  be  a  path  P?;+1  '  of  gates  from  an  input  labeled  xr+i  to 

the  output  labeled  Tr+ i+s  such  that  each  gate  has  value  xr+\.  Thus,  Pj?|1+S^  and  Pjr+S') 
are  vertex-disjoint.  Extending  this  idea,  we  have  the  desired  conclusion  about  disjoint  paths. 

We  now  develop  a  second  fact  about  these  paths  that  is  needed  in  the  lower  bound.  Let 
PrV+s'>  be  a  path  from  xr  to  Tr+S,  as  suggested  in  Fig.  9.11(a).  Those  paths  connecting 
inputs  to  any  one  output  form  a  binary  tree,  as  suggested  in  Fig.  9.11(b).  The  number  of 
inputs  from  which  there  is  a  path  to  Tj  is  e(  j),  the  number  of  inputs  on  which  Tj  depends. 

To  derive  the  lower  bound  on  Cnmon  (fmeige),  let  denote  the  length  (number 

of  edges  or  non-input  vertices  (gates))  on  the  shortest  path  from  an  input  labeled  Xi  to  the 
output  labeled  Tj.  (Clearly,  d(i,j )  =  0  unless  i  <  j  <  *  +  k.)  Since  the  path  from 
input  Xi  to  output  Tj  described  above  has  a  length  at  least  as  large  as  d(i,j),  it  follows  that 

Cfimon  (  /merge  J  satisfies  the  following  bound: 


(  k 

Cfimon  (./merge)  >  max  <  ^  d(r>  r+s)\0<S<k 

Since  the  maximum  of  a  set  of  integers  is  at  least  equal  to  the  average  of  these  integers,  we 
have  the  following  for  k  =  n/2  >  1 : 


(  f(n)  1 

\  J  merge  J 


> 


^^d(r,r  +  s)  = 


2k 


k+  1 


s— 0  r=  1 


j= 1  i=  1 


The  last  identity  follows  by  using  the  fact  that  d(i,j)  =  0  unless  *  <  j  <  i  +  k.  But 
yT_t  d(i,  j )  is  the  sum  of  the  distances  of  the  shortest  paths  from  the  relevant  inputs  of  x  to 
output  Tj,  1  <  j  <  2k.  Since  these  paths  form  a  binary  tree  and  Tj  depends  on  e(  j)  inputs, 
this  is  the  external  path  length  of  a  tree  with  e(  j)  leaves.  The  external  path  length  is  at  least 
e(  j)  ri°g2  e(  j)l  — 2Clog2  e( -|-e(  j)  (see  Problem  9.4).  In  turn,  a;|’log2  x]  —  2^l°SlX^  +x  > 
x  log2  x,  because  [~log2  x]  =  (log2  x)  +  S  for  0  <  4  <  1  and  x  |"log2  x]  —  2 riog2  ^  +  x  = 
x  log2  x  +  x{  \  —  2s  +  S),  where  1  —  2s  +  S  is  easily  shown  to  be  a  concave  function  whose 
minimum  value  occurs  at  either  <5  =  0  or  S  =  1 ,  both  of  which  are  0.  Thus,  l-2s  +  S>  0 
and  the  result  follows.  Thus,  the  size  of  smallest  monotone  circuit  satisfies  the  following 
lower  bound  when  n  =  2k: 


Cn 


mon 


> 


^f:ieU)log2eU)] 

3= 1 

2  k 

log^1 

3  = 1 


The  last  equality  uses  the  definition  of  e(  j)  given  above.  By  applying  the  reasoning  in 
Problem  2.1  and  captured  in  Fig.  2.23,  it  is  easy  to  show  that  the  above  sum  is  at  least  as 
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large  as  (2/(fc  +  1 ) )  (log2  e)  J* Lj  2/  loge  V  d  U>  whose  value  is  (2/(fc  +  1))  [(fc2/2)  log2  k  — 
(l/4)fc2(log2  e)  +  1/4],  From  this  the  desired  conclusion  follows,  since  k  =  n/2.  ■ 

We  now  present  lower  bounds  on  the  monotone  circuit  size  of  Boolean  convolution  and 
Boolean  matrix  multiplication,  problems  for  which  the  gap  between  the  monotone  and  non¬ 
monotone  circuit  size  is  much  larger  than  for  sorting  and  merging. 

9.6.2  The  Function  Replacement  Method 

The  function  replacement  method  simplifies  monotone  circuits  by  replacing  a  function  com¬ 
puted  at  an  internal  vertex  by  a  new  function  without  changing  the  function  computed  by  the 
overall  circuit.  Since  a  replacement  step  eliminates  gates  and  reduces  a  problem  to  a  subprob¬ 
lem,  the  method  provides  a  basis  for  establishing  lower  bounds  on  circuit  complexity  using 
proof  by  induction. 

We  describe  two  replacement  rules  and  then  apply  them  to  Boolean  convolution  and 
Boolean  matrix  multiplication.  These  two  problems  are  defined  in  the  usual  way  except  that 
variables  assume  Boolean  values  in  B  and  the  multiplication  and  addition  operators  are  inter¬ 
preted  as  AND  and  OR,  respectively. 


REPLACEMENT  RULES  A  replacement  rule  is  a  rule  that  allows  a  function  computed  at  a  vertex 
of  a  circuit  to  be  replaced  by  another  without  changing  the  function  computed  by  the  circuit. 
Before  stating  such  rules  for  monotone  functions,  we  introduce  some  terminology. 


DEFINITION  9.6. 1  Let  x  denote  the  variables  of  a  Boolean  function  f  :  Bn  i— >  B.  An  implicant 
of  f  is  a  product  (AND,),  7t,  of  a  subset  of  the  literals  of  f  (the  variables  and  their  complements) 
such  that  if  tt(x)  =  1  on  input  n-tuple  x,  then  f(x)  =  1.  (This  is  denoted  it  <  f.)  The  set  of 
implicants  ofafimction  f  is  denoted  1(f). 

An  implicant  tt  of  a  Boolean  function  f  is  a  prime  implicant  if  there  is  no  implicant  7ti 
different  from  tt  such  that  tt  <  tt\  <  f.  The  set  of  prime  implicants  of  a  function  f  is  denoted 

PI(fl 

A  monotone  implicant  (also  called  a  monony)  of  a  monotone  Boolean  function  f  :  Bn  i— ►  B 
is  the  product  (AND,)  7t  of  uncomplemented  variables  of  f  such  that  if  t t(x)  =  1  on  input  n-tuple 
x,  then  f(x)  =  1.  The  empty  monom  has  value  1.  The  set  of  monotone  implicants  of  a 
function  f  is  denoted  Imon(f). 

A  monotone  implicant  tt  of  a  Boolean  function  f  is  a  monotone  prime  implicant  if  there  is 
no  monotone  implicant  tt\  different  from  tt  such  that  tt  <  tt\  <  f.  The  set  of  monotone  prime 
implicants  ofafimction  f  is  denoted  PImon(f). 


The  products  in  the  sum-of-products  expansion  (SOPE)  are  (non-monotone)  implicants 
of  a  Boolean  function.  If  a  function  is  monotone,  it  has  monotone  implicants  (monoms).  The 
prime  implicants  of  a  Boolean  function  /  define  it  completely;  the  OR  of  its  prime  implicants 
is  a  formula  representing  it.  In  the  case  of  a  monotone  Boolean  function,  the  prime  implicants 
are  monotone  prime  implicants.  (See  Problem  9.33.) 

When  it  is  understood  from  context  that  an  implicant  or  prime  implicant  is  monotone, 
we  may  omit  the  word  “monotone”  and  use  the  subscript  “mon.”  This  will  be  the  case  in  this 


section. 
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The  function  Cj+\  =  ( Pj  A  Cj)  V  gj  used  in  the  design  of  a  full  adder  (see  Section  2.7) 
is  a  monotone  function  of  the  variables  Pj,  Cj,  and  gj.  Its  set  of  implicants  is  /(cj+i)  = 
{pj  A  Cj,  gj,  pj  A  gj,  Cj  A  gj,  pj  A  Cj  A  gj}.  If  any  one  of  these  products  has  value  1  then  so 
does  Cj+ Its  set  of  prime  implicants  is  PI(cj+ 1)  =  {pj  A  Cj,gj}  C  I(cj+ 1)  because  these 
are  the  smallest  products  for  which  Cj+\  has  value  1.  Thus,  Cj+ 1  is  defined  by  PI(cj+\)  and 
represented  as  Cj+ 1  =  ( pj  A  Cj)  V  gj. 

We  now  present  a  replacement  rule  for  monotone  functions  that  captures  the  following 
idea:  if  a  function  g  computed  by  a  gate  of  a  monotone  circuit  has  a  monom  7r  that  is  not  a 
monom  of  the  function  /  computed  by  the  complete  circuit,  then  7t  can  be  removed  from  g 
without  affecting  the  value  of  /.  This  idea  is  valid  in  monotone  circuits  because  the  absence 
of  negation  provides  only  one  way  to  eliminate  extra  monoms,  namely,  by  ORing  them  with 
products  containing  a  subset  of  their  variables.  Taking  the  AND  of  a  monom  with  another 
term  creates  a  longer  monom.  Thus,  since  monoms  that  are  not  monoms  of  the  function  / 
computed  by  a  circuit  must  be  eliminated,  there  is  no  loss  of  generality  in  assuming  that  they 
are  not  produced  in  the  first  place. 

DEFINITION  9.6.2  Let  /  :  Bn  t— >  B  and  g  :  Bn  i— >  B  be  two  monotone  functions.  Let  g  be 
computed  ivitbin  a  monotone  circuit  for  f.  The  following  is  a  replacement  rule  for  g: 

a)  Let  7Ti  £  PI(g)  andleth  be  defined  by  P  1(h)  =  PI(g)  —  {7r}.  Replace  the  gate  computing 

g  by  one  computing  h  if  for  all  monoms  tt'  (including  the  empty  monom),  7r  A  7r'  ^  PI  (f). 

We  now  show  that  any  monom  7r  satisfying  Rule  (a)  can  be  removed  from  PI(g)  because 
it  contributes  nothing  to  the  computation  of  /. 

LEMMA  9.6. 1  Letf  :  Bn  i— »  B  and  g  :  Bn  i— >  B  be  two  monotone  functions  and  let  7 r  £  PI  ( g ) 
be  such  that  for  all  monoms  tt'  (including  the  empty  monom),  7t  A  tt'  ^  PI(f).  Let  h  be  defined 
by  PI{h)  =  PI(g)  —  {tt}.  If  g  is  computed  in  some  monotone  circuit  for  f,  the  circuit  obtained 
by  replacing  g  by  h  also  computes  f. 

Proof  Let  C  denote  a  circuit  for  /  within  which  the  function  g  is  computed.  Let  C*  be 
the  circuit  obtained  by  replacing  g  by  h  under  Rule  (a).  Since  h  <  g  and  the  circuit  is 
monotone,  the  function  /*  computed  by  C*  satisfies  f*  <  f.  We  suppose  that  f*  fi  f  and 
show  that  a  contradiction  results. 

If  f*  f  f,  there  is  some  input  n-tuple  a  £  Bn  such  that  f*(a )  =  0  but  /(a)  =  1. 
Since  the  only  change  in  the  circuit  occurred  at  the  gate  computing  g,  by  monotonicity,  on 
this  tuple  g(a)  =  1  but  h(a)  =  0.  It  follows  that  7r(a)  =  1.  Let  tt'  be  a  prime  implicant  of 
/  for  which  it' {a)  =  1.  We  show  that  tt'  =  tt  A  tt\  for  some  monom  tt\,  in  contradiction 
to  the  condition  of  the  lemma. 

Let  Xi  be  any  variable  of  7r.  Then  aj  =  1  since  7r(a)  =  1.  Define  the  n-tuple  b  by 
bi  =  0  and  bj  =  dj  for  j  f  i.  Since  b  <  a  and  7 r(6)  =  0,  h  and  g  both  have  the  same 
value  on  b.  Thus,  both  circuits  compute  the  same  value,  which  must  be  0  by  monotonicity 
and  the  fact  that  /*  =  0  on  a.  Since  tt' {a)  =  1  and  Tr'{b)  =  0  but  only  one  variable  was 
changed,  namely  Xi,  tt'  must  contain  Xi.  Since  Xi  is  an  arbitrary  variable  of  7T,  it  follows 
that  tt'  contains  tt  as  a  sub-monom.  ■ 

This  last  result  implies  that  if  a  function  /  has  no  prime  implicants  containing  more  than 
l  variables,  then  any  monoms  containing  more  than  l  variables  can  be  removed  where  they 
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are  first  created.  This  will  be  useful  later  when  discussing  Boolean  convolution  and  Boolean 
matrix  multiplication,  since  each  of  their  prime  implicants  depends  on  two  variables. 

BOOLEAN  CONVOLUTION  Convolution  over  commutative  rings  is  defined  in  Section  6.7.  In 
this  section  we  introduce  the  Boolean  version,  which  is  defined  by  a  monotone  multiple-output 
function,  and  derive  a  lower  bound  of  n?^2  on  its  monotone  circuit  size.  We  also  show  that 
over  a  complete  basis  Boolean  convolution  can  be  realized  by  a  circuit  of  nearly  linear  size. 

DEFINITION  9.6.3  The  Boolean  convolution  function  /i”nv  :  B2n  i— >  B2n~ 1  maps  Boolean 
n-tuples  a  =  ( ao ,  a\, . . . ,  an_i)  and  b  =  (bo,  b\, . . . ,  bn-\)  onto  a  (2 n  —  \)-tuple  c,  denoted 
c=  a®  b,  where  Cj,  0  <  j  <  2n  —  2,  is  defined  as 

cj=  a-Abs 

r+s=j 

Boolean  convolution  can  be  realized  by  a  circuit  over  the  standard  basis  BIq  for  multiplying 
binary  numbers  (see  Section  2.9)  as  follows.  Represent  a  and  b  by  the  following  integers  where 
q  =  flog2  n]  +  1: 

n—  1  n— 1 

a  =  J2  ai2<li’  b  =  J2  bi2qj 

i— 0  j—0 

That  is,  each  bit  in  a  and  b  is  separated  by  [log2  n\  zeros.  The  formal  product  of  a  and  b  is 

In— 2  /  \ 

<*>=  E  E  «© )  2'lk 

k= 0  \i+j=k  J 

Because  no  inner  sum  in  the  above  expression  is  more  than  2 n  —  1 ,  at  most  q  bits  suffice  to 
represent  it  in  binary  notation.  Consequently,  there  is  no  carry  between  any  two  inner  sums. 
It  follows  that  an  inner  sum  is  non-zero  if  and  only  if  Cfc  =  1 .  Thus,  the  value  of  Cfc  can  be 
obtained  by  forming  the  OR  of  the  bits  in  positions  kq,  kq  +  1 , .  . . ,  kq  +  q  —  1  of  the  product. 
Since  two  binary  m  tuples  can  be  multiplied  in  the  standard  binary  notation  by  a  circuit  of 
size  O  (m(logm)(loglogm))  (see  Section  2.9.3),  the  function  /i"nv  can  be  computed  by  a 
circuit  of  size  O  (n(log2  n)  (log  log  n))  since  m  =  nq  =  0(n  log  n). 

THEOREM  9.6.3  The  circuit  size  of  f  mlv  '■  B2n  i— >  B2n  1  over  the  standard  basis  satisfies 

CV20  (/c(o2v)  =  O  (n (log2  n)  (log  log  n) ) 

Our  goal  is  to  use  the  function  replacement  method  to  show  that  every  monotone  circuit 
for  Boolean  convolution  has  size  Q (n?^2).  As  explained  above,  the  method  is  designed  to 
use  induction  to  prove  lower  bounds  on  monotone  circuit  size.  Each  replacement  step  removes 
prime  implicants  from  the  function  g  computed  at  some  gate  and  changes  the  function  /  com¬ 
puted  by  the  circuit.  If  the  new  function  /*  is  in  the  same  family  as  /,  the  gate-replacement 
process  can  continue  and  induction  can  be  applied.  Since  the  convolution  function  does  not 
necessarily  change  to  another  instance  of  itself  on  fewer  variables,  we  place  this  function  in  the 
class  of  semi-disjoint  bilinear  forms. 
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DEFINITION  9.6.4  Let  f(n-m-p)  =  (fy,  /2, . . . ,  fp),  where  each  fr  :  Bn+m  ^  B,  1  <  r  <  p, 
is  a  monotone  function  on  n-tuple  x  and  m-tuple  y;  that  is,  fr(x,y )  £  B.  /*■  n,m,P)  is 

a  bilinear 

form  if  each  prime  implicant  of  each  fr,  1  <  r  <  p,  contains  one  variable  of  x  and  one  of  y. 
A  function  f(n’m-P)  is  a  semi-disjoint  bilinear  form  if  in  addition  PI(fr)  (~l  PI(fs)  =  0  for 
r  f  s  and  each  variable  is  contained  in  at  most  one  prime  implicant  of  any  one  function. 

Before  deriving  a  lower  bound  on  the  number  of  gates  needed  for  a  semi-disjoint  bilinear 
form,  we  introduce  a  new  replacement  rule  peculiar  to  these  forms. 

LEMMA  9.6.2  No  gate  of  a  monotone  circuit  of  minimal  size  for  a  semi-disjoint  bilinear  form 
f{n,m,p )  computes  a  function  g  whose  prime  implicants  include  either  two  variables  of  x  or  ofy. 

Proof  We  suppose  that  a  minimal  monotone  circuit  does  contain  a  gate  g  whose  prime 
implicants  contain  either  two  variables  of  x  or  two  of  y  and  show  that  a  contradiction 
results.  Without  loss  of  generality,  assume  that  PI(g)  contains  X{  and  Xj,  i  f  j.  If  there  is 
a  gate  g  satisfying  this  hypothesis,  there  is  one  that  is  closest  to  an  input  variable.  This  must 
be  an  OR  gate  because  AND  gates  increase  the  length  of  prime  implicants.  Because  the  gate 
in  question  is  closest  to  inputs,  at  least  one  of  Xj  and  Xj  is  either  an  input  to  this  OR  gate  or 
is  the  input  to  some  OR  gate  that  is  on  a  path  of  OR  gates  to  this  gate.  (See  Fig.  9.12.) 

A  simple  proof  by  induction  on  its  circuit  size  demonstrates  that  if  a  circuit  for  f(n’m’P) 
=  (/ 1, . . . ,  fp)  contains  a  gate  computing  g  then  fr,  1  <  r  <  p,  can  be  written  as  follows 
(see  Problem  9.36): 


fr{x,  y)  =  ( pr{x ,  y)  A  g(x,  y))  V  qr( x,  y)  (9.1) 

Here  pr(x,  y)  and  qr(x,  y)  are  Boolean  functions.  Of  course,  if  for  no  r  is  fr  a  function  of 
g,  then  we  can  set  pr(x,  y)  =  0  and  the  circuit  is  not  minimal. 

If  fr  depends  on  g,  pr(x,y )  f  0.  However,  pr(x,y)  f  1  because  otherwise  both 
Xi  and  Xj  are  prime  implicants  of  fr,  contradicting  its  definition.  Also,  PI(pr{x,y)) 
cannot  have  any  monoms  containing  one  or  more  instances  of  a  variable  in  x  or  two  or 
more  instances  of  variables  in  y  because  when  ANDed  with  g  they  produce  monoms  that 
could  be  removed  by  Rule  (a)  of  Definition  9.6.2  and  the  circuit  would  not  be  minimal.  It 
follows  that  PI(pr(x,  y))  can  contain  only  single  variables  of  y.  But  this  implies  that  for 
some  k,  A  g  £  I(fr),  which  together  with  the  fact  that  Xi,Xj  £  PI(g)  implies  that 


Figure  9.12  If  PI(g)  for  a  gate  g  contains  Xi  and  Xj,  then  either  Xi  or  Xj  is  input  to  an  OR 
gate  on  a  path  of  OR  gates  to  g. 
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yk  A  Xi,  yk  A  Xj  £  /(/r).  But  yk  A  Xi  and  yt~  A  Xj  cannot  both  be  prime  implicants  of  fr 
because  they  violate  the  requirement  that  no  two  prime  implicants  of  fr  contain  the  same 
variable.  It  follows  that  fr  does  not  depend  on  g.  ■ 

The  Boolean  convolution  function  is  a  semi-disjoint  bilinear  form.  Each  implicant  of 
each  component  of  c  =  a  ®  b  contains  one  variable  of  a  and  one  of  b.  In  addition,  the  prime 
implicants  of  Cj  and  Cj  are  disjoint  if  i  f  j.  Finally,  each  variable  appears  in  only  one  implicant 
of  a  component  function,  although  it  may  appear  in  more  than  one  such  function. 

THEOREM  9.6.4  Let  fin<m<P)  :  Bn+m  ^  Bp,  f(n-m’p)  =  (fu  f2, . . . ,  fp),  be  a  semi-disjoint 
bilinear  form,  where  fr(x,  y)  £  B.  Let  f  be  the  number  offimctions  in  {/ i,  fi,  ■  ■  ■ ,  fp}  that 
are  essentially  dependent  on  the  input  variable  Xi,  1  <  i  <  n.  Then  the  monotone  circuit  size  of 
f(n,m,p)  must  satisjy  the  following  loiver  bound: 

n 

Cnmon  (/(n>m’p)) 

i=  1 

Proof  The  proof  is  by  induction.  The  basis  for  induction  is  the  semi-disjoint  bilinear  form 
on  two  variables  f^,l,1\x,y)  =  x  f\y.  In  this  case  d\  =  1  and  Cnmon  (Z^1,1,1-*)  =  1. 
We  assume  that  any  semi-disjoint  bilinear  form  in  n  +  to  —  1  or  fewer  variables  satisfies  the 
lower  bound.  We  show  that  setting  Xj  =  0  produces  another  function  that  is  a  semi-disjoint 
bilinear  form  and  allows  the  removal  of  at  least  \fif  gates.  The  lower  bound  follows  by 
induction.  We  consider  only  minimal  circuits. 

Let  Ui  denote  the  number  of  functions  in  {/i ,  /2, . . . ,  fp}  that  are  essentially  dependent 
on  Xi  and  have  a  single  prime  implicant  (such  as  Co  =  do  A  bo  and  C2n-i  =  an- 1  A  bn-\ 
for  convolution).  Setting  Xi  =  0  eliminates  the  u,  AND  gates  at  which  these  outputs 
are  computed.  We  show  that  at  least  \J f  —  Ui  OR  gates  can  also  be  eliminated.  Since 
Ui  +  s/dj  —  Ui  >  \fdi  (see  Problem  9.8),  we  have  the  desired  conclusion. 

Let  Vi  denote  those  outputs  that  depend  on  Xi  whose  associated  function  has  at  least 
two  prime  implicants.  Then  Vi  =  di  —  Ui.  There  must  be  at  least  one  OR  gate  on  each 
path  P  from  Xi  to  fr  £  Vt  because,  if  not,  each  path  contains  only  ANDs  and  fr  has  only 
one  prime  implicant  that  contains  Xi,  in  contradiction  to  the  definition  of  Vi. 

We  claim  that  on  each  path  P  from  an  input  labeled  Xi  to  some  fr  £  Vj  there  is  an 
OR  gate  computing  a  function  gt  such  that  Xit  A  yjt  £  PL(gt)  for  some  Xit  f  Xi.  Let 
Ei  =  {gt}  be  those  OR  gates  closest  to  an  input  vertex  Xi.  Call  Ei  the  bottleneck  for 
variable  Xi.  We  shall  show  that  ©  >  f  di  —  Ui  and  that  each  of  the  gates  in  Ei  can  be 
eliminated  by  setting  Xi  =  0. 

If  the  claim  is  false,  then  there  is  a  path  P  from  input  Xi  to  output  fr  £  Vt  such  that 
for  each  OR  gate  (let  it  compute  gt)  on  P  there  is  no  Xit  f  Xi  such  that  Xit  A  £ 
PL(gt).  Therefore,  either  all  monoms  of  PL(gt)  a)  contain  xt  or  b)  are  monoms  that  are 
not  implicants  of  an  output  (they  are  not  of  the  form  Xit  A  yjt).  In  case  a),  setting  x,  =  0 
causes  the  OR  gates  on  P  to  have  value  0,  which  forces  the  AND  gates  on  P  and  fr  to 
have  value  0,  contradicting  the  definition  of  fr  (it  has  at  least  two  prime  implicants).  In  the 
second  case  under  Rule  (a)  the  monoms  not  containing  x,;  can  be  removed  without  changing 
the  functions  computed.  Thus,  when  Xi  =  0,  the  output  of  each  OR  gate  on  P  has  value  0, 
which  contradicts  the  definition  of  fr  since  it  contains  at  least  two  prime  implicants. 

We  now  show  that  \Ei\  >  \J di  —  Ui.  Since  each  of  the  OR  gates  in  Ei  has  a  prime 
implicant  Xit  A  yjt  not  containing  Xi,  their  outputs  can  be  set  to  1  by  setting  Xit  =  yjt  =  1 
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for  1  <  t  <  \Ei\.  This  eliminates  all  dependence  of  fr  £  V  on  However,  since  inputs 
have  only  been  assigned  value  1  (and  not  0),  this  dependence  on  Xi  can  be  eliminated  only 
if  all  functions  in  Vi  have  value  1;  that  is,  at  least  one  prime  implicant  of  each  of  them  is 
set  to  1  by  this  assignment.  Since  each  variable  appears  in  at  most  one  prime  implicant  of 
a  function,  the  number  of  different  variables  Xit  (and  t/jt)  that  are  set  to  1  is  at  most  \Ei\. 
Thus,  at  most  \Ei\2  prime  implicants  can  be  assigned  value  1  by  this  assignment.  Thus,  if 
|.Ei|2  <  ( di  —  Ui),  we  have  a  contradiction  since  \Vi\  =  ( di  —  Ui). 

We  now  show  that  \Ei\  OR  gates  can  be  eliminated  by  setting  Xi  =  0.  Since  each  gate  is 
a  closest  gate  to  an  input  labeled  Xi  with  the  stated  property,  there  is  an  OR  gate  on  the  path 
to  it  with  Xi  as  an  input.  Thus,  setting  Xi  =  0  eliminates  one  of  the  two  inputs  to  the  OR 
gate  and  the  need  for  the  gate  itself.  ■ 

Since  for  each  of  the  n  input  variables  in  a  there  are  n  output  functions  in  c  =  a  ®  b  that 
depend  on  it  (di  =  n  for  1  <  i  <  n),  the  following  corollary  is  immediate. 

COROLLARY  9.6. 1  Let  /i”nv  :  B2n  i— >  Bln~l  be  the  Boolean  convolution  function.  Then  the 
monotone  circuit  size  of  fconv  satisfies  the  following  lower  bound: 

Cnmon  (f&l)  >  ™3/2 

Unfortunately,  no  upper  bound  on  the  monotone  circuit  size  of  /i”nv  is  known  that 
matches  this  lower  bound.  A  stronger  statement  can  be  made  for  Boolean  matrix  multipli¬ 
cation. 

BOOLEAN  MATRIX  MULTIPLICATION  Matrix  multiplication  over  rings  is  discussed  at  length  in 
Section  6.3.  In  this  section  we  introduce  the  Boolean  version.  An  I  x  J  matrix  A  =  [dij], 
1  <  *  <  I  and  1  <  j  <  J,  is  a  two-dimensional  array  of  elements  in  which  dij  is  the  element 
in  the  ith  row  and  jth  column.  We  take  the  entries  in  a  matrix  to  be  Boolean  variables. 

DEFINITION  9.6.5  Let  A  =  [a^k],  1  <  i  <  n  and  1  <  k  <  m,  B  =  [ bk.j],  1  <  k  <  m  and 
1  <  j  <  P>  tind  C  =  [aj],  1  <  i  <  n  and  1  <  j  <  p,  be  n  x  m,  m  x  p,  and  n  x  p  matrices, 
respectively.  The  product  C  =  A  x  B  of  A  and  B  is  the  fimction  fif  f  ’1’1  '■  Bnm+rnp  *  Bnp 
whose  value  on  the  matrices  A  and  B  is  the  matrix  C  whose  entry  in  row  i  and  column  j,  Cij,  is 
defined  as 

m 

d-i.j  —  \J  ditk  A  bk.j 

fc=  1 

In  a  more  general  context  the  AND  operator  A  and  the  OR  operator  V  are  replaced  by  the 
multiplication  and  addition  operators  over  rings. 

The  above  definition  can  be  used  as  an  algorithm  to  compute  Cij,  \  <  i  <  n  and  1  < 
j  <  p,  from  the  entries  in  matrices  A  and  B.  We  call  this  the  standard  matrix-multiplication 
algorithm.  It  uses  nmp  ANDs  and  n(m—  1  )p  ORs.  We  now  show  that  every  monotone  circuit 
for  matrix  multiplication  requires  at  least  this  many  ANDs  and  ORs. 

Clearly  the  matrix  multiplication  function  is  a  bilinear  form.  We  associate  the  entries  in 
A  with  the  tuple  x  and  those  in  B  with  y.  We  strengthen  Theorem  9.6.4  to  obtain  a  lower 
bound  on  the  number  of  ORs  needed  to  realize  it  in  a  monotone  circuit. 
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LEMMA  9.6.3  Every  monotone  circuit for  Boolean  matrix  multiplication  requires  at  least 

n(m  —  1  )p  OR  gates. 

Proof  In  the  proof  of  Theorem  9.6.4  we  identified  a  set  Ei  of  gates  called  the  bottleneck 
associated  with  each  input  variable  Xj.  We  demonstrated  that  each  of  these  gates  can  be 
eliminated  by  setting  x  j  =  0  and  that  Ei  has  at  least  \fdf-~vti  gates,  where  f  —  Ui  =  |  Vi  \ 
is  the  number  of  circuit  outputs  that  depend  essentially  on  Xi  and  have  at  least  two  prime 
implicants.  These  results  were  shown  by  proving  that  all  gates  in  Ei  are  OR  gates  and  that 
the  ith  of  these  gates’  associated  function  contains  a  prime  implicant  of  the  form  Xit  A  yjt 
for  Xit  f  Xi.  We  then  demonstrated  that  the  dependence  of  the  outputs  in  Vi  on  the  input 
Xi  can  be  eliminated  by  setting  Xit  =  yjt  =  1  for  1  <  t  <  Ei  but  that  this  contradicts 
the  definition  of  a  semi-definite  bilinear  form  if  \Ei\2  <  \Vi\.  Finally,  we  proved  that  by 
setting  Xi  =  0  each  of  the  gates  in  Ei  could  be  eliminated.  For  this  lemma,  we  need  only 
strengthen  the  lower  bound  on  Ei  for  matrix  multiplication. 

Consider  a  minimal  circuit.  The  proof  is  by  induction  on  to,  with  the  base  case  being 
to  =  1.  In  the  base  case  =  a^i  A  b ij  for  1  <  i  <  n  and  1  <  j  <  p  and  no  ORs 
are  needed.  As  inductive  hypothesis  we  assume  that  flfff  requires  at  least  n(m  —  2 )p 
OR  gates.  We  show  that  setting  any  column  of  A  in  ff'p^  to  0  eliminates  np  OR  gates 

and  reduces  the  problem  to  an  instance  of  /^™'  '  'p> .  It  follows  that  requires 

n(m  —  1  )p  OR  gates. 

When  to  >  2,  each  output  function  Cij  has  at  least  two  prime  implicants.  We  apply 
the  bottleneck  argument  to  this  case.  Consider  the  bottleneck  E^k  associated  with  input 
variable  a^k-  We  show  that  \E^\  >  p,  from  which  it  follows  that  at  least  p  OR  gates  can  be 
eliminated  by  setting  x^k  =  0.  This  reduces  the  problem  to  another  set  of  bilinear  forms. 
Repeating  this  for  1  <  i  <  n,  we  eliminate  np  OR  gates,  one  column  of  A,  and  one  row  of 
B.  Let  Vij  =  { Cij  |  1  <  j  <  p]  be  the  outputs  that  depend  on  a^k- 

To  show  that  \Eif  >  p,  let  the  fh  gate  of  Eitk  compute  Xit  A  yjt  for  Xit  f  a^k- 
Here  Xit  =  a,it,kt  and  yjt  =  bitjt  for  some  it,  kt.  It,  and  jt-  If  we  set  all  entries  in 
{ait,k  t  I  1  <t<  |  1  <  t  <  \Ei?k\}  to  1,  we  eliminate  all  dependence  of 

outputs  in  Vi?k  on  a^k-  However,  since  \Vij \  =  p,  the  set  must  contain  at  least  one 

variable  used  in  c^j  for  each  1  <  j  <  p.  Thus,  \Elf  >  p.  ■ 

We  now  derive  a  lower  bound  on  the  number  of  AND  gates  needed  for  Boolean  matrix 
multiplication. 

LEMMA  9.6.4  Every  monotone  circuit  for  Boolean  matrix  multiplication  requires  at  least 

nmp  AND  gates. 

Proof  Consider  a  minimal  circuit.  The  proof  is  by  induction  on  to,  the  base  case  being 
to  =  1.  In  the  base  case  Cij  =  a,,i  A  b\j  for  1  <  i  <  n  and  1  <  j  <  p  and  np  ANDs  are 
needed,  since  np  results  must  be  computed,  each  requiring  one  AND,  and  all  functions  are 
different.  As  inductive  hypothesis  we  assume  that  f\fff  ]'P>  requires  at  least  n(m  —  l)p 
AND  gates.  We  show  that  setting  any  column  of  A  in  to  1  and  the  corresponding 

row  of  B  to  0  eliminates  np  AND  gates  and  reduces  the  problem  to  an  instance  of  1,p^  ■ 

It  follows  that  requires  nmp  AND  gates. 

For  arbitrary  1  <  k  <  to  let  Gij  be  a  gate  closest  to  inputs  computing  a  function  g 
such  that  PI(g)  contains  a^k  A  bk,j.  Since  the  gate  associated  with  Cij  has  a^k  A  bktj  as  a 
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prime  implicant,  there  is  such  a  gate  G^j.  Furthermore,  Gij  must  be  an  AND  gate  because 
OR  gates  cannot  generate  new  prime  implicants.  Let  G\  and  G2  be  gates  generating  inputs 
for  Gij.  Let  them  compute  functions  g\  and  g2 ■  It  follows  from  the  definition  of  Gij  that 
a-i,k  £  PI {9 1)  and  bk,j  £  P/(<?2)  or  vice  versa.  Let  the  former  hold.  If  aitk  =  1,  g\  =  1 
and  Gitj  can  be  eliminated.  We  now  show  that  G^j  f  G^ji  for  (i,j)  f  Suppose 

not.  Since  i  f  i'  or  j  f  j',  there  are  at  least  three  distinct  variables  among  ditk,  a*/,*,,  bktj, 
and  bk,j>  ■  Therefore  either  g\  or  c/2  has  at  least  two  of  these  variables  as  prime  implicants. 
By  Lemma  9.6.2  this  circuit  is  not  minimal,  a  contradiction.  ■ 


We  summarize  the  results  of  this  section  below. 

THEOREM  9.6.5  The  standard  algorithm  •  Bnm+mp  1— >  Bnp,  the  Boolean  matrix 

mtdtiplication  function,  is  optimal.  It  uses  nmp  AN  Dr  and  n(m  —  1  )p  ORr. 

We  now  show  that  the  monotone  circuit  size  of  the  clique  function  is  exponential. 


9.6.3  The  Approximation  Method 

The  approximation  method  is  used  to  derive  large  lower  bounds  on  the  monotone  circuit  size 
for  certain  monotone  Boolean  functions.  In  this  section  we  use  it  to  derive  an  exponential 
lower  bound  on  the  size  of  the  smallest  monotone  circuit  for  the  clique  function  fjfque  k  ■ 
gn(n~- 1)/2  ^  £>  This  method  provides  an  interesting  approach  to  deriving  large  lower  bounds 
on  circuit  size.  However,  as  mentioned  in  the  Chapter  Notes,  it  is  doubtful  that  it  can  be  used 
to  obtain  large  lower  bounds  on  circuit  size  over  complete  bases. 

The  approximation  method  converts  a  monotone  circuit  C  computing  a  function  /  into 
an  approximation  circuit  C  computing  a  function  /.  This  is  done  by  repeatedly  replacing  a 
previously  unvisited  gate  farthest  away  from  the  output  gate  by  an  approximation  gate  that 
computes  an  approximation  to  the  AND  or  OR  gate  it  replaces.  Each  replacement  operation 
changes  the  circuit  and  increases  by  a  small  amount  the  number  of  input  tuples  on  which  / 
and  the  function  computed  by  the  new  circuit  differ.  When  the  entire  replacement  process  is 
complete,  the  resulting  circuit  approximates  /  poorly;  that  is,  /  and  /  differ  on  a  large  number 
of  inputs.  For  this  to  happen,  the  original  monotone  circuit  must  have  had  many  gates,  each 
of  which  contributes  a  relatively  small  number  of  errors  to  the  complete  replacement  process. 
This  is  the  essence  of  the  approximation  method. 

There  are  a  number  of  ways  to  approximate  AND  and  OR  gates  in  a  monotone  circuit. 
Razborov  [270],  who  introduced  the  approximation  method,  used  an  approximation  for  gates 
based  on  clique  indicators,  monotone  functions  associated  with  a  subset  of  a  set  of  vertices 
that  has  value  1  exactly  when  there  is  an  edge  between  every  pair  of  vertices  in  the  subset.  In 
this  section  gates  are  approximated  in  terms  of  the  SOPE  and  POSE  forms,  a  method  used  by 
Amano  and  Maruoka  [20]  to  approximate  the  clique  function. 

It  is  not  hard  to  show  that  the  monotone  circuit  size  of/Sue,fe^O(n").  (See  Prob¬ 
lem  9.37.)  We  now  show  that  all  monotone  circuits  for  f^que  k  have  size  Cnmon  (/diqUe,fc)  ^ 

1  ^  8)mi„(yfe=T/2,n/(2fe))j  which  js  ^(n1/3)  foj.  £  pr0p0rti0nal  to  II2/3. 
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TEST  CASES  The  quality  of  an  approximation  to  the  clique  function  /©jue  fc  is  determined 
by  providing  positive  and  negative  test  inputs.  A  fc-positive  test  input  is  a  binary  n(n  —  1 ) /  12- 
tuple  that  describes  a  graph  containing  a  single  fc-clique. 

The  negative  test  inputs,  defined  below,  describe  graphs  that  have  many  edges  but  not 
quite  enough  to  contain  a  fc-clique.  A  special  set  of  negative  test  inputs  is  associated  with 
balanced  partitions  of  the  vertices  of  an  n-vertex  graph  G  =  (V,E).  A  (fc  —  l)-balanced 
partition  of  V  =  {v\, .  . . ,  vn}  is  a  collection  of  fc  —  1  disjoint  sets,  V\,  © . .  . ,  Vk-\,  such 
that  each  set  contains  either  \n/(k—  10]  or  \  n/{k  —  1)J  elements.  (By  Problem  9.5  there  are 
w  —  n  mod  (fc  —  1)  sets  of  the  first  kind  and  k  —  1  —  w  sets  of  the  second  kind.)  The  graph 
associated  with  a  particular  ( k  —  1  )-balanced  partition  has  an  edge  between  each  pair  of  vertices 
in  different  sets  and  no  other  edges.  For  each  (fc  —  ©balanced  partition,  a  fc-negative  test 
input  is  a  binary  n(n  —  1) /2-tuple  x  describing  the  graph  G  associated  with  that  partition. 

LEMMA  9.6.5  There  are  r+  k-positive  test  inputs,  where 

=  (n\=  n\ 

+  \k)  fcl(n-fc)! 

and  r_  k-negative  test  inputs,  where  for  w  =  n  mod  (fc  —  1) 

n\ 

T~  = 

Proof  It  is  well  known  that  t_|_  =  (’/).  To  derive  the  expression  for  r_  we  index  each 
element  of  each  set  in  a  (fc—  ©balanced  partition.  Such  a  partition  has  w  =  n  mod  (fc—  1) 
sets  containing  |"n/(fc  —  1)]  elements  and  fc  —  1  —  w  sets  containing  \  n/(k  —  1)J  elements. 
The  elements  in  the  first  w  sets  are  indexed  by  the  pairs  {(i,  1),  (i,  2), . . . ,  (i,  \ n/{k  — 
1)])}  for  I  <  i  <  w.  Those  in  the  remaining  fc  —  1  —  w  sets  are  indexed  by  the  pairs 
{(i,  1),  (i,  2), . . . ,  (i,  \  n/ (fc  —  1)J )}  for  w  +  1  <  i  <  fc  —  1.  (See  Fig.  9.13.)  Let  V 
be  the  set  of  all  such  pairs.  To  define  a  fc-negative  graph,  we  assign  each  vertex  in  the  set 
V  =  { 1 , 2, . .  . ,  n)  to  a  unique  pair.  This  partitions  the  vertices  into  fc  —  1  sets.  If  vertices 
va  and  Vb  are  in  the  same  set,  the  edge  variable  xatb  =  0;  otherwise  xa-b  =  1.  These 
assignments  define  the  edges  in  a  graph  G  =  (V,E).  There  are  n !  assignments  of  vertices 
to  pairs.  Of  these,  there  are  (\n/(k  —  1)]  !)©Ln/(fc  —  1)J  !)fe-1-u'w!(fc  —  1  —  «;)!  that 


(1,1)  (1,2)  (1,3)  (1,4) 

(2,1)  (2,2)  (2,3) 

(3,1)  (3,2)  (3,3) 

V3  V-J  i?!  v2 

Vc,  V5  Itio 

V6  V4  vs 

V2  V\  V$  Vj 

v5  vl0  v9 

SO 

00 

?> 

Figure  9.13  A  set  of  pairs  'P  indexing  the  elements  of  sets  in  a  (fc  —  1  (-balanced  partition  of 
a  set  V  of  n  vertices.  In  this  example  n  =  10  and  fc  =  4  and  the  partition  has  three  sets,  Vi, 
V2,  and  V3  containing  four,  three,  and  three  elements,  respectively.  Shown  are  two  assignments  of 
variables  to  pairs  in  V  that  correspond  to  the  same  partition  of  V. 
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correspond  to  each  graph.  To  see  this,  observe  that  there  are  \n/(k  —  1)] !  ways  to  permute 
the  elements  in  each  of  the  first  w  sets  and  \n/(k  —  1 ) J !  ways  to  permute  the  elements  in 
each  of  the  remaining  k  —  1  —  w  sets.  Also,  each  of  the  first  w  (the  last  k  —  1  —  w)  sets  have 
the  same  size  and  can  be  ordered  in  any  of  w!  ((fc  —  1  —  w)!)  ways  without  changing  the 
graph.  ■ 

APPROXIMATOR  CIRCUITS  It  simplifies  the  development  of  lower  bounds  to  assume  that  each 
AND  gate  in  a  circuit  is  followed  by  an  OR  gate  and  vice  versa  and  that  the  output  gate  is  an 
AND  gate.  This  requirement  can  be  met  by  interposing  between  successive  AND  (OR)  gates 
an  OR  (AND)  gate  both  of  whose  inputs  are  connected  together.  Since  this  transformation  at 
most  triples  the  number  of  gates,  an  exponential  lower  bound  on  the  size  of  the  transformed 
circuit  yields  an  exponential  lower  bound  on  the  size  of  the  original  circuit. 

A  monotone  circuit  for  fj^ue  k  has  (edge)  variables  drawn  from  the  set  {xij  |  1  <  i  < 
j  <  n}.  The  approximation  to  an  input  variable  Xjj  is  Xij  itself.  Gates  in  a  circuit  are  succes¬ 
sively  replaced  by  approximator  circuits  starting  with  a  gate  that  is  at  greatest  distance  from  the 
root  (output  vertex)  and  continuing  with  previously  unvisited  gates  at  greatest  distance  from 
the  root.  Thus,  when  an  AND  or  OR  gate  is  replaced,  its  inputs  have  previously  been  replaced 
by  functions  fi  and  fr  that  approximate  the  functions  gi  and  gr  computed  in  the  original 
circuit. 

Approximations  to  AND  (A)  and  OR  (V)  gates  are  denoted  A  and  V,  respectively.  As  seen 
below,  the  approximation  given  to  a  gate  is  context  dependent.  Approximations  are  defined 
in  terms  of  endpoint  sets.  Given  a  set  of  edge  variables,  for  example  {xit2,  £1,3,  £23,  £1,4},  its 
associated  endpoint  set  is  the  set  of  vertex  indices  used  to  define  the  edge  variables,  which  is 
{1, 2,  3, 4}  in  this  example.  Given  a  term  t  (a  product  (AND)  or  sum  (OR)  of  edge  variables), 
the  endpoint  set  associated  with  it,  E{t),  is  the  endpoint  set  of  the  edge  variables  appearing  in 
the  term.  For  example,  if  t  =  X\t2  A  £13  A  £23  A  £13  or  t  =  £13  V  £13  V  £23  V  £1,4,  then 
E{t)  =  { 1,  2,  3,  4}.  The  endpoint  size  of  a  term  t,  denoted  \E(t)\,  is  the  number  of  indices 
in  Eft). 

Consider  a  gate  to  be  approximated.  Let  its  two  inputs  be  from  gates  that  compute  func¬ 
tions  fi  and  fr.  Like  any  function,  fr  and  fi  can  be  represented  in  either  the  monotone  SOPE 
or  POSE  form.  (All  SOPEs  and  POSEs  in  this  section  are  monotone.)  The  approximation 
rules  for  AND  and  OR  gates  are  described  below  and  denoted  A  and  V,  respectively.  Here  we 
letp=  IV (k  —  l)/2j  and  q  =  \n/{Ak)\. 

A:  The  approximation  fiAfr  to  fi  A  fr  is  obtained  by  representing  fi  A  fr  in  the  sum-of- 
products  expansion  (SOPE)  and  eliminating  all  product  terms  whose  endpoint  set  contains 
more  than  p  vertices.  It  follows  that  /;  A  fr  >  fiAfr. 

V:  The  approximation  /jV  fr  to  fi  V  fr  is  obtained  by  representing  fi  V  fr  in  the  product-of- 
sums  expansion  (POSE)  and  eliminating  all  sum  terms  whose  endpoint  set  contains  more 
than  q  vertices.  It  follows  that  fi  V  fr  <  /;V/r. 

Since  fi  A  fr  >  fiAfr  and  /;  V  /r  <  /;V/r,  if  a  positive  test  input  x  causes  the  output 
of  the  approximated  circuit  to  have  value  0  when  it  should  have  value  1,  then  there  is  an 
approximated  AND  gate  (including  the  output  gate)  that  has  value  0  on  a;  when  it  should  have 
value  1.  Similarly,  if  there  is  a  negative  test  input  x  that  causes  the  approximated  output  to  be 
1  when  it  should  be  0,  there  is  an  approximated  OR  gate  that  has  value  1  ons  when  it  should 
have  value  0.  We  now  examine  the  performance  of  approximator  circuits. 
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PERFORMANCE  OF  APPROXIMATOR  CIRCUITS  We  now  show  that  when  the  approximation  pro¬ 
cess  is  complete,  the  approximation  circuit  for  /©jue  k  makes  a  very  large  number  of  errors 
but  that  each  gate  approximation  introduces  a  small  number  of  errors.  Thus,  many  gates  must 
have  been  approximated  to  produce  the  large  number  of  errors  made  by  the  fully  approximated 
circuit.  In  fact,  we  show  that  the  approximating  circuit  for  /©jue  k  either  has  output  identi¬ 
cally  0,  thereby  making  one  error  on  each  of  the  r_)_  =  (/)  positive  test  inputs  (it  produces  0 
when  it  should  produce  1),  or  makes  r_/ 2  errors  on  the  r_  negative  test  inputs  (it  produces 
1  when  it  should  produce  0).  On  the  other  hand,  we  also  show  that  approximating  one  AND 
or  OR  gate  causes  a  small  number  of  errors,  at  most  Cand  errors  per  AND  gate  on  positive 
test  inputs  and  at  most  eoR  errors  per  OR  gate  on  negative  test  inputs,  quantities  for  which 
upper  bounds  are  derived  below.  It  follows  that  the  original  circuit  for  f^que  k  has  at 

least  7\f/e and  AND  gates  or  at  least  r_/(2eoa)  OR  gates.  The  lower  bound  on  the  monotone 
circuit  size  of  /©jue  k  is  the  larger  of  these  two  lower  bounds. 

LEMMA  9.6.6  Letk  <  n  +  1 .  Then  any  approximation  circuit  for  f^que  k  either  computes  a 
function  that  is  identically  zero  or  makes  errors  on  half  of  the  k-negative  test  inputs. 

Proof  Let  the  approximation  circuit  for  fjfque  k  compute  the  function  fjfque  k  .  If  this 
function  is  identically  zero,  we  are  done.  Suppose  not.  Since  the  output  gate  in  the  original 

circuit  is  an  AND  gate,  the  function  fjfque  k  is  represented  by  a  SOPE  in  which  each  term 
is  the  product  of  variables  whose  endpoint  set  (the  vertices  involved)  has  size  at  most  p. 

Because  /©jue  k  is  not  identically  zero,  there  is  a  non-zero  term  t  such  that  fjfque  k  >  t. 
An  error  is  made  on  a  negative  test  input  if  t  =  1.  But  this  happens  only  if  each  of  the 
endpoints  in  E{t)  is  in  a  different  set  of  the  (fc  —  l)-balanced  partition  defining  the  negative 
test  input. 

Let  <j)  be  the  fraction  of  the  negative  test  inputs  on  which  t  =  1.  We  derive  a  lower 
bound  to  (f>  by  deriving  an  upper  bound  on  the  fraction  \  of  the  (k  —  1) -balanced  partitions 
with  the  property  that  two  or  more  vertices  in  E(t)  fall  into  the  same  set.  It  follows  that 
</>>!-  X- 

To  simplify  bounding  y,  we  use  the  one-to-one  correspondence  developed  in  the  proof 
of  Lemma  9.6.5  between  the  n  vertices  in  V  =  {1, 2,  3, . . . ,  n}  and  the  pairs  V  associated 
with  a  (fc  —  ©balanced  partition.  Since  E{t)  has  at  mostp  vertices,  the  number  of  ways  to 
assign  two  vertices  from  E{t)  to  pairs  in  V  so  that  two  of  them  fall  into  the  same  set,  Afy 
is  at  most  the  number  of  ways  to  choose  two  vertices  from  a  set  of  p  vertices,  p(p  —  l)/2, 
times  the  number  of  ways  of  assigning  these  two  vertices  to  pairs  in  V,  m2,  and  the  number 
of  ways  of  assigning  the  remaining  n  —  2  vertices,  (ro  —  2) !.  Here  m2  is  at  most  the  product 
of  the  number  of  ways  of  choosing  a  pair  for  the  first  vertex,  (fc  —  1)  \n/(k  —  1)] ,  and  the 
number  of  ways  of  choosing  a  pair  for  the  second  from  the  same  set,  \n/{k  —  1)]  —  1.  Thus, 
N2  is  at  most  (p{p  —  l)/2)(fc  —  1)  \n/(k  —  1)]  (|  n/{k  —  1)]  —  l)(n  —  2)!,  which  is  at  most 
p2\n/(k  —  1)]  ( n  —  1)1/2.  Since  there  are  n!  assignments  of  vertices  in  V  to  pairs  in  V, 
X  <  fn/(fc — 1)1/ (2rz).  Becausep  =  [_ V ~  l)/2j  ,  x  is  most  1  /4  since  k—  1  <  n.  ■ 

We  now  derive  upper  bounds  on  the  number  of  errors  introduced  through  the  approxima¬ 
tion  of  individual  AND  and  OR  gates.  Since  we  have  assumed  that  AND  and  OR  gates  alternate 
on  any  path  between  inputs  and  outputs,  it  follows  that  the  inputs  fi  and  fr  to  an  AND  gate 
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are  outputs  of  OR  gates  (and  vice  versa).  Furthermore,  by  the  approximation  rules,  if  fi  and  fr 
are  inputs  to  an  AND  (OR)  gate,  every  sum  (product)  in  their  POSE  (SOPE)  has  an  endpoint 
set  size  of  at  most  q  ( p ).  We  now  show  that  each  replacement  of  a  gate  by  its  approximator 
introduces  a  relatively  small  number  of  errors.  We  begin  by  establishing  this  fact  for  OR  gates. 

LEMMA  9.6.7  Let  an  OR  gate  V  and  its  approximation  V  each  he  given  as  inputs  the  functions 
fi  and  fr  whose  SOPE  contains  product  terms  of  endpoint  size  p  or  less.  Then  the  number  of 
k-negative  test  inputs  for  which  V  and  V  produce  different  outputs  fV  has  value  0  hut  V  has  value 
1)  is  at  most  eon  where  w  =  n  mod  (fc  —  1): 

_  {n/2)q+\n- q-  1)! 

6°R  ( \n/(k  —  l)]!)u'([n/(fc  —  1)J  \)k~1~ww\(k  —  1  —  in)! 

Proof  Let  /correct  =  / I  V  fr  and  /app  rox  =  fid  fr-  Let  ii, . .  .,ti  be  the  product  terms 
in  the  SOPE  for  /correct-  Since  the  endpoint  size  of  all  terms  in  the  SOPE  of  /correct  is  at 
most  p,  each  term  is  the  product  of  at  most  p(p  —  l)/2  variables. 

Using  the  association  between  [k  —  1) -balanced  partitions  and  pairs  of  indices  given 
in  the  proof  of  Lemma  9.6.5,  we  count  N,  the  number  of  one-to-one  mappings  from  V 
to  V  for  which  /correct (*)  =  0  but  fapprox(x)  =  1,  after  which  we  divide  by  D,  the 
number  of  mappings  corresponding  to  a  single  partition  of  the  variables,  to  compute  eoR  = 
N/D.  From  the  proof  of  Lemma  9.6.5  we  have  that  D  =  ( \n/(k  —  1)] !)'“'( [n/(fc  — 
1)J  !)k~1~ww\(k  —  1  —  w)L 

To  derive  an  upper  bound  to  N,  observe  that  fappIOx(x)  is  obtained  by  converting  the 
SOPE  of  /correct  to  a  POSE  and  deleting  all  sums  in  this  POSE  whose  endpoint  set  size 
exceeds  q.  Thus,  N  is  at  most  the  number  of  ways  to  assign  vertices  to  pairs  in  V  that 
causes  a  deleted  sum  to  be  0  because  the  new  POSE  may  now  become  1.  But  this  can 
happen  only  if  the  endpoint  set  size  of  the  deleted  product  is  at  least  q  +  1 .  Thus,  only  if  at 
least  9+1  vertices  in  a  sum  are  assigned  values  is  it  possible  to  have  /correct (x)  =  0  and 
/aPProx(-^)  —  L 

Below  we  show  that  each  vertex  can  be  assigned  at  most  n/2  different  pairs  in  V.  It 
follows  there  are  at  most  (n/2)9+1(n  —  g  —  1)!  ways  to  assign  pairs  to  q  +  1  or  more 
vertices  because  the  first  9+1  can  be  assigned  in  at  most  (n/ 2)9+1  ways  and  the  remaining 
(n  —  q  —  1)  vertices  can  be  assigned  in  at  most  (n  —  q  —  1)!  ways.  This  is  the  desired  upper 
bound  on  N. 

We  now  show  that  every  mapping  from  V  to  V  that  corresponds  to  a  negative  test  input 
x  assigns  each  vertex  to  at  most  n/2  pairs  in  V. 

Let  t\, ...  ,ti  be  product  terms  in  the  SOPE  of  /correct-  We  examine  these  terms  in 
sequence.  Consider  a  partial  mapping  from  V  to  V  that  assigns  values  to  variables  so  that 
at  least  one  variable  in  each  of  the  products  t\, . . . ,  fj_i  is  0,  thereby  insuring  that  each 
product  is  0.  Consider  now  the  ?’th  product,  £,;•  If  the  partial  mapping  assigns  value  0  to  at 
least  one  of  its  variables,  we  move  on  to  consider  f  j+i .  (It  cannot  set  all  variables  in  £,;  to  1 
because  we  are  considering  mappings  causing  all  terms  to  be  0.) 

Suppose  that  the  partial  mapping  has  not  assigned  value  0  to  any  of  the  variables  of  ti. 
There  are  two  cases  to  consider.  For  some  variable  xa,b  of  t,  either  a)  one  or  b)  both  of  the 
vertices  va,Vb  €  V  has  not  been  assigned  a  pair  in  V.  In  the  first  case,  assign  the  second 
vertex  to  the  set  containing  the  first,  thereby  setting  xa,b  =  0.  This  can  be  done  in  at  most 
\n/(k—  1)]  —  1  <  n/(k—\)  ways  since  the  set  contains  at  most  fn/ (fc — 1)~|  elements  and  at 
least  one  of  them  has  been  chosen  previously,  namely  the  first  vertex.  In  the  second  case  the 
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two  vertices  can  be  assigned  to  at  most  (fc—  l)(["n/(fc—  1)]  )(\n/(k—  1)]  —  1)  <  2  n2/ (fc—  1) 
pairs  because  the  first  can  be  assigned  to  (fc  —  1)  sets  each  containing  at  most  \n/(k  —  1)] 
elements  and  the  second  must  be  assigned  to  one  of  the  remaining  elements  in  that  set. 

The  number  of  ways  to  choose  variables  in  t ,  so  that  it  has  value  0  is  the  number  of 
ways  to  choose  a  variable  of  each  kind  multiplied  by  the  number  of  ways  to  assign  values  to 
it.  Let  a  be  the  number  of  variables  of  U  for  which  one  vertex  has  previously  been  assigned 
a  pair  and  let  [3  be  the  number  of  variables  for  which  neither  vertex  has  been  assigned  a 
pair.  (/?  <  p(p  —  1 )  /2  —  ct  since  ti  has  at  most  p(p  —  1  )/2  variables.)  Thus,  a  variable 
of  the  first  kind  can  be  assigned  in  at  most  an/(k  —  1)  ways  and  the  number  of  ways  of 
assigning  the  two  vertices  in  variables  of  the  second  kind  is  at  most  /32n2/(k  —  1).  Since 
each  vertex  associated  in  such  pairs  can  be  assigned  in  the  same  number  of  ways,  7,  it  follows 
that  7 2  <  (32n2  / (fc  —  1).  Thus,  7  <  y//32n2/(k  —  1). 

Summarizing,  the  variables  in  ti  can  be  assigned  in  at  most  the  following  number  of 
ways  so  that  ti  has  value  0: 

an/(k  —  1)  + 

This  quantity  is  largest  when  a  =  0  and  is  at  most  n/ 2  since  p  =  [ \J  (k  —  l)/2j ,  which  is 
the  desired  conclusion.  ■ 


We  now  derive  an  upper  bound  on  the  number  of  errors  that  can  be  made  by  AND  gates 
on  fc-positive  inputs. 


LEMMA  9.6.8  Let  an  AND  gate  A  and  its  approximation  A  each  be  given  as  inputs  the  functions 
fi  and  fr  whose  POSE  contains  sum  terms  of  endpoint  size  q  or  less.  Then  the  number  ofk-positive 
test  inputs  for  which  A  and  A  produce  different  outputs  (A  has  value  1  but  A  has  value  0)  is  at 


most  eAND- 


eAND  — 


(n/2)p+1  (n  —  p  —  1)! 
k\(n  —  k)\ 


Proof  The  proof  is  similar  to  that  of  Lemma  9.6.7.  Let  /correct  =  fl  A  fr  and  /ap pr0x  = 
fi  A  fr.  Let  Ci, ...  ,Ci  be  the  sum  terms  in  the  POSE  for  /correct-  Since  by  induction  the 
endpoint  size  of  all  terms  in  the  POSE  of  fi  and  fr  is  at  most  q,  each  term  in  /correct  is  the 
sum  of  at  most  q(q  —  1) /2  variables. 

In  this  case  we  count  the  number  of  fc-positive  test  graphs  (they  contain  one  fc-clique) 
that  cause  /correct  (x)  =  1  but  /apProx(*)  =  0.  Since  a  fc-positive  test  graph  contains  just 
those  edges  between  a  specified  set  of  fc  vertices,  we  define  each  such  graph  by  a  one-to-one 
mapping  from  the  vertices  (endpoints)  in  V  to  the  integers  ]N(n)  =  {1,2 , ,n\,  where 
we  adopt  the  rule  that  vertices  mapped  to  the  first  fc  integers  are  those  in  the  clique  associated 
with  a  particular  test  graph.  It  follows  that  each  fc-positive  test  graph  corresponds  to  exactly 
fc!(n  —  fc)!  of  these  1-1  mappings.  Then,  Cand  is  the  number  of  such  1-1  mappings  for 
which  /correct  (*)  =  1  but  /apProx (x)  =  0  divided  by  fc!(n  -  fc)!. 

We  show  that  any  mapping  that  results  in  /correct (*)  =  1  assigns  each  endpoint  to  at 
most  n/2  values  from  IN  (n).  But  /ap  prox(®)  =  0  for  positive  test  inputs  only  if  more  than 
p  endpoints  are  assigned  values,  because  /approx  is  obtained  from  /correct  by  discarding 
product  terms  in  its  SOPE  that  contain  more  than  p  endpoints.  It  follows  that  at  most 
(n/2)p+1  (n  —  p  —  1) !  of  the  positive  test  inputs  result  in  an  error  by  the  approximate  AND 
gate.  Dividing  by  k\(n  —  fc)!,  we  have  the  desired  upper  bound  on  Band- 
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To  complete  the  proof  we  must  show  that  each  endpoint  is  assigned  at  most  n/2  values 
from  ]N(n).  Consider  the  sum  terms  C\, ...  ,ci  in  the  POSE  of  /correct  in  sequence  and 
consider  a  partial  mapping  from  V  to  IN(n)  that  causes  at  least  one  variable  in  each  of  the 
sums  Ci, ... ,  Ci- 1  to  be  1,  thereby  insuring  that  the  value  of  each  sum  is  1.  Now  consider 
the  ith  sum,  c*.  If  the  partial  mapping  assigns  value  1  to  at  least  one  variable,  we  move  on 
to  Cj+i.  (It  cannot  set  all  variables  in  c,  to  0  because  we  are  considering  mappings  causing 
all  terms  to  be  1 .) 

We  now  extend  the  mapping  by  considering  the  set  C,  of  variables  of  Cj  that  have  not 
been  assigned  a  value.  A  given  variable  xaj,  in  Ci  has  either  one  or  no  endpoints  (vertices) 
previously  mapped  to  an  integer  in  ]N(n).  If  one  endpoint,  say  a,  has  been  assigned  an 
integer,  the  other  endpoint,  b,  can  be  assigned  to  at  most  one  of  k  —  2  integers  that  cause 
xa,b  =  1  because  endpoint  a  was  previously  assigned  a  value  in  the  range  {1,2 , ...  ,k} 
together  with  at  least  one  other  vertex  and  b  must  be  different  from  them.  Because  there  are 
most  q  =  \n/{4k)\  variables  of  the  first  type,  there  are  at  most  q(k  —  2)  ways  to  assign  the 
one  endpoint  of  a  variable  xQit,  of  the  first  type  so  that  a :a,b  =  1. 

Consider  now  variables  of  the  second  type.  There  are  at  most  q(q  —  1)  /2  such  variables 
and  at  most  (q(q  —  \)/2)k{k  —  1)  ways  to  make  assignments  to  both  endpoints  so  that 
a  variable  has  value  1.  This  follows  because  each  endpoint  is  assigned  to  a  distinct  integer 
among  the  first  k  integers  in  IN(n) .  Since  each  endpoint  can  be  assigned  in  the  same  number 
of  ways,  this  number  is  at  most  \/(q(q  —  l)/2  )k{k  —  1). 

It  follows  that  the  number  of  ways  to  assign  an  endpoint  so  that  the  correct  and  approx¬ 
imate  functions  differ  is  at  most  q(k  —  2)  +  y/ q(q  —  l)/2 )k(k  —  1)  <  2 qk,  which  is  no 
more  than  n/2  since  q  =  [n/(4k)\ .  This  is  the  desired  conclusion.  ■ 

The  desired  result  follows  from  the  above  lemmas. 


THEOREM  9.6.6  For  n  >  13  and  8  <  <  n/2,  every  monotone  circuit  for  the  clique  function 

/clique  k  '  Bn(n- 1  i— >  B  has  a  circuit  size  satisfying  the  following  lower  bound: 

(/£ue,fe)  >  1(1.8 r^V^/2,n/{2k)) 

The  largest  value  for  this  lower  bound  is  C'omon  (/c^qUe  f)  =  2°(n  /  \ 

Proof  From  the  discussion  at  the  beginning  of  this  section,  we  see  that  the  monotone  circuit 
size  of  /clique, fc  is  at  least  min  (r+/eAND,'r_/(2e0R)).  Thus, 


Cfimon  (/clique^)  >  mm  [2{n/ 2)p+fn  -  p  -  l)\’  (n/2)  9+1  (n  —  q  -  1 ) ! 
(n  —  p)p+1  (n—q)q+1 


>  min 


2(n/2)P+1  ’  (n/2)?+1 


Let  8  <k<  n/2.  It  follows  that  p  =  [fy  ( k  —  l)/2j  <  v/n/(2y/2)  and  q  =  \  n/{2k)\  < 
n/16.  Thus,  p,  g<n/10ifn>  13.  Hence  both  ( n—p )  and  ( n—q )  are  at  least  9n/10,  and 


(/ 


(n) 

clique,  k 


>  min 


-(1.8)p+1,  (1.8)9+1 
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The  desired  conclusion  follows  from  this  and  the  observation  that  p  +  1  >  fik  —  1/2  and 
q  +  1  >  n/(2k).  That  the  maximum  value  of  min(v/fc  —  1/2,  n/ (2k))  is  fl©1/3)  under 
variation  of  k  is  left  as  a  problem.  (See  Problem  9.38.)  ■ 

9.6.4  Slice  Functions 

Although,  as  shown  above,  some  monotone  functions  have  exponential  circuit  size  over  the 
monotone  basis,  it  is  doubtful  that  the  methods  of  analysis  used  to  obtain  this  result  can  be 
extended  to  derive  such  bounds  over  the  standard  basis.  (See  the  Chapter  Notes.) 

This  section  introduces  a  note  of  optimism  by  showing  that  the  monotone  circuit  size  of 
monotone  slice  functions  can  provide  a  strong  lower  bound  on  the  circuit  size  of  such  functions 
over  the  standard  basis.  In  addition,  there  are  NP  -complete  languages  whose  characteristic 
functions  are  slice  functions.  Thus,  if  such  functions  can  be  shown  to  have  super-polynomial 
monotone  circuit  size,  P  fi  NP. 

Let  |*|  denote  the  number  of  Is  in  x.  We  now  define  the  slice  functions. 

DEFINITION  9.6.6  A  function  s  :  Bn  i— >  B  is  a  slice  function  if  there  is  an  integer  0  <  k  <  n 
such  that  s(x)  =  0  if  \x\  <  k  and  s(x )  =  1  if  \x\  >  k.  The  kth  slice  of  a  function 
f  :  Bn  i — ►  13,  0  <  k  <  n,  is  the  function  fik^  :  Bn  i — >  B  defined  below. 

{0  \x\  <  k 

f(x)  \x\  =  k 

1  |*|  >  k 

It  should  be  clear  from  this  definition  that  slice  functions  are  monotone.  Below  we  show 
that  if  a  Boolean  function  f  on  n  variables  has  a  large  circuit  size,  then  one  of  its  slices  has  a 
circuit  size  that  differs  from  the  size  of  /  by  at  most  a  multiplicative  factor  that  is  linear  in  n. 
Thus,  a  function  /  has  a  large  circuit  size  if  and  only  if  one  of  its  slice  functions  has  a  large 
circuit  size. 

We  set  the  stage  with  a  lemma  that  shows  that  the  circuit  size  of  a  Boolean  function  is 
bounded  above  by  the  circuit  size  of  its  slices  plus  an  additive  term  linear  in  its  number  of 
variables. 

LEMMA  9.6.9  LetTl0  be  the  standard  basis  and  f  :  Bn  i— >  B.  Then  the  following  holds,  where 
is  the  circuit  size  of  all  the  slices  simultaneously: 

Cn0(f)  =  Cn0(f[°K  /[1], . . . ,  /[©  +  0(n) 

Proof  The  goal  is  to  construct  a  circuit  for  /  given  the  input  tuple  x  and  a  circuit  for 
all  the  functions  /[°1,  /  I1!, . . . ,  /N.  This  is  easily  done.  We  construct  a  circuit  to  count 
the  number  of  l’s  among  the  n  inputs  and  represent  the  result  in  binary.  We  then  supply 
this  number  as  an  address  to  a  direct  storage  address  function  (multiplexer)  where  the  other 
inputs  are  the  values  of  the  slice  functions.  If  the  address  is  |a|,  the  output  of  the  multiplexer 
is  /[lall.  Since,  as  shown  in  Lemma  2.11.1,  the  counting  circuit  can  be  realized  with  a  circuit 
of  size  linear  in  n,  and,  as  shown  in  Lemma  2.5.5,  the  multiplexer  in  question  can  be  realized 
with  a  linear-size  circuit,  the  result  follows.  ■ 

We  now  establish  the  connection  between  the  circuit  size  of  a  function  and  that  of  one  of 
its  slices. 
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THEOREM  9.6.7  Let  Q0  be  the  standard  basis  and  /  :  Bn  *— »  B.  Then  there  exists  0  <  k  <  n 
such  that 

-  0(1)  <  Co0  (/[fc])  <  Ono(/)  +  0(n) 

Proof  The  first  inequality  follows  from  Lemma  9.6.9,  the  following  inequality  and  the 
observation  that  at  least  one  term  in  an  average  is  greater  than  or  equal  to  the  average. 

Cn0  (/[°],/[I],...,/M)  <^Cn0(/[i]) 

i 

The  second  inequality  uses  the  fact  that  the  fcth  slice  of  a  function  can  be  expressed  as 

f[k](X)  =  T<jf\x)f{x)  +  T^fx) 

Since  Tjn\x)  can  be  realized  by  a  circuit  of  size  linear  in  n  (see  Theorem  2.11.1),  the  second 
inequality  follows.  ■ 

In  Theorem  9.6.9  we  show  that  the  monotone  circuit  size  of  slice  functions  provides  a 
lower  bound  on  their  non-monotone  circuit  size  up  to  a  polynomial  additive  term.  Before 
establishing  this  result  we  introduce  the  concept  of  pseudo-negation.  A  pseudo-negation  for 
variable  x,;  in  a  monotone  Boolean  function  /  :  Bn  t— >  B  is  a  function  hi  such  that  replacing 
each  instance  of  x,;  in  a  circuit  for  /  by  hi  does  not  change  the  value  computed  by  the  circuit. 
Thus,  the  pseudo-negation  hi  acts  like  the  real  negation  x,. 

In  Theorem  9.6.9  we  also  show  that  for  1  <  i  <  n  the  punctured  threshold  function 
Tf'^i  :  Bn  i— »  B,  which  depends  on  all  the  variables  except  Xi,  is  a  pseudo-negation  for  a  kth 
slice  of  every  monotone  function.  Since  for  a  given  k  each  of  these  threshold  functions  can  be 
realized  by  a  monotone  circuit  of  size  0{n  log  n)  (see  Theorem  6.8.2),  they  can  all  be  realized 
by  a  monotone  circuit  of  size  0(n2  log  n).  Although  this  result  can  be  used  in  Theorem  9.6.9, 
the  following  stronger  result  is  used  instead. 

We  now  describe  a  circuit  that  computes  all  of  the  above  pseudo-negations  efficiently.  This 
circuit  uses  the  complementary  number  system,  a  system  that  associates  with  each  integer  i 
in  the  set  IN(n)  =  {0, 1, 2, . . . ,  n  —  1}  the  complementary  set  IN(n)  —  {?}.  It  makes  use  of 
results  on  sorting  networks  found  in  Chapter  6. 

THEOREM  9.6.8  The  set  |  1  <  i  <  n}  of  pseudo-negations  can  be  realized  by  a  monotone 

circuit  of  size  0(n  log2  n). 

Proof  We  assume  that  n  =  2s.  If  not,  add  variables  with  value  0  to  increase  the  number  to 
the  next  power  of  2.  This  does  not  change  the  value  of  the  function  on  the  first  n  variables. 

For  this  proof  let  the  pseudo-negations  be  defined  for  0  <  i  <  n  —  1  and  on  the 
variables  whose  indices  are  in  IN(n).  (We  subtract  1  from  each  index.)  Let  Di  =  ]N(n)  — 
{*}  denote  the  indices  of  the  variables  on  which  depends.  An  efficient  monotone 

circuit  to  compute  all  the  pseudo-negations  |  i  £  IN(n)}  is  based  on  an  efficient 

decomposition  of  the  sets  {Di  \  i  £  IN(n)}. 

For  a,b>  0,  let  Ua<b  be  defined  by 

Ua,b  =  {a2b  +  c|0<c<26  —  1} 
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For  example,  C/33  =  {24,25,26,27,29,30,311,1713  =  {4,  5, 6, 7},  and  A,i  =  {4,5}. 
The  set  A,b  has  size  2b. 

For  n  =  2s,  every  set  A  =  IN(ro)  —  {*}  can  be  represented  as  the  disjoint  union  of  the 
sets  Ua,b  below,  where  0  <  a^j  <  2S~J  —  1 .  (This  is  the  complementary  number  system; 
see  Fig.  9.14.) 


A  =  1  U  Uais_1)S—2  U  •  •  •  U  Uaio,0 

To  see  this,  note  that  if*  is  in  the  first  (second)  half  of  IN(n),  U„iB  _1>s-i  denotes  the  second 
(first)  half;  that  is,  aijS_i  =  1  (aiiS_i  =  0).  The  next  set,  C/ais_2>s_ 2,  is  the  half  of  the 
remaining  set  A  —  A*, S_,,s-i  that  does  not  contain  i,  etc.  Thus,  A  is  decomposed  as 
the  disjoint  union  of  sets  of  size  2s-1,  2s ~2 , . . . ,  2°  For  example,  when  n  =  16,  A  = 
A3  U  A, 2  U  C/0,1  U  C/2,o •  Figure  9.14  shows  the  values  of  a2jS_  1,  ajiS_2, . .  . ,  a^o  for  each 
i  6  N(n)  for  n=  8. 

As  suggested  in  Fig.  9.14,  the  sets  {A  |  i  £  IN(n)}  have  either  C7o,s-i  or  A,s-i  in 
common.  Similarly,  they  also  have  either  As- 1  U  As-2>  A,s-l  U  A,s-2>  A,s-i  U  A,s-2> 
or  A,s-1  U  Uo,s-2  in  common.  Continuing  in  this  fashion,  we  construct  the  sets  {A  |  *  € 
]N(n)}  by  successively  forming  the  disjoint  union  of  2J  sets,  1  <  j  <  s.  Assembling  the 
sets  in  this  fashion  is  much  more  economical  than  assembling  them  individually. 

The  value  of  t©^,  i  €  IN (n)  ,  is  the  fcth  largest  variable  whose  index  is  in  Di.  From  now 
on  we  equate  the  variables  with  their  indices.  Sorting  the  sets  into  which  Di  is  decomposed 
simplifies  the  computation.  But  these  sets  are  exactly  the  sets  that  are  sorted  by  Batcher’s 
sorting  network  based  on  Batcher’s  merging  algorithm.  (See  Theorem  6.8.3.)  Since  on 
Boolean  data  a  comparator  consists  of  one  AND  for  the  max  operation  and  one  OR  for  the 
min  operation,  a  monotone  circuit  of  size  0{n  log2  n)  exists  to  sort  the  sets  { Aj  I  0  <  i  < 
2s~i  -  1,0  <j  <  s  -  1}. 

The  functions  ,  0  <  i  <  n  —  1 ,  can  be  obtained  by  sorting  the  sets  { A,j  I  0  <  i  < 
2S~J  —  1.  0  <  J  <  s  —  1},  merging  them  in  groups  to  form  A  for  i  £  IN (n),  as  suggested 
above,  and  then  taking  the  fcth  largest  element.  A  faster  way  merges  the  sorted  versions  of 
the  sets  Ai>s_,,s-i>  Aiia_2,s-2>  ■  •  ■ ,  Aii0,o  in  the  order  in  which  A  is  assembled  above. 
For  each  of  these  sets  the  sorting  network  presents  its  elements  in  sorted  order. 
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Figure  9.14  The  coefficients  dij  of  Di  =  ]N(n)  —  {*}  in  the  expansion  Ais_llS-i  U 
Ai  s_2,s-2  U  ■  •  •  U  A, 0,0  for  n  =  2s  =  8  and  s  =  3. 
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Since  only  the  fcth  element  of  Di  is  needed,  it  is  not  necessary  to  merge  all  the  elements 
in  each  set  when  two  sets  are  merged.  To  see  which  elements  need  to  be  merged,  let  A fj)  = 
L'ai.„  ,,s  l  U  Uaig_2tS—2  U  •  •  •  U  Uaijj.  Then  D,  -  A  fj)  is  a  set  of  size  2j  -  1.  Observe 
that  the  fcth  element  of  Di  can  be  obtained  by  merging  elements  of  rank  fc  and  fc  —  1  of 
Aj(l)  with  the  element  of  t7a(i  0),o-  (They  all  have  value  0  or  1.)  The  middle  element  is  the 
fcth  element  in  Di.  To  obtain  elements  of  rank  fc  and  fc  —  1  of  Ai(l),  the  elements  of  rank 
fc,  fc  —  1,  fc  —  2  and  fc  —  3  of  A  ,  (2)  are  merged  with  the  two  elements  of  Uail,  i  and  the 
middle  two  taken.  In  general,  to  obtain  the  elements  of  rank  fc, . . . ,  fc  —  2J  +  1  of  A,(j), 
the  elements  of  rank  fc, . . . ,  fc  —  27+1  +  1  of  A  i(j  +  1)  are  merged  with  the  2J  elements  of 
Uai  .j  and  the  middle  2°  taken. 

We  now  count  the  number  of  extra  AND  and  OR  gates  needed  to  perform  the  merges. 
There  are  2S~°  sets  A i(j).  The  2°  elements  needed  from  these  sets  are  obtained  by  merging 
2J+1  elements  of  A  i(j  +  1)  with  the  2J  elements  of  Uaij,j.  Since  these  sets  can  be  merged 
in  a  comparator  network  with  0(j2J)  comparators  (see  Theorem  6.8.2),  it  follows  that  all 
the  sets  A fj),  0  <  i  <  n  —  1,  can  be  formed  with  O(jn)  gates  for  0  <  j  <  s  —  1. 
Summing  over  j,  0  <  j  <  (log2  n)  —  1  shows  that  a  total  of  0{n  log2  n)  extra  gates  suffice. 
Since  0(n  log2  n)  gates  are  used  to  sort  the  sets  {U^j  |  0  <  i  <  2s--7  —  1,  0  <  j  <  s  —  1}, 
the  desired  conclusion  follows.  ■ 

We  can  now  show  that  a  large  lower  bound  on  the  monotone  circuit  size  of  a  slice  function 
implies  a  large  lower  bound  on  its  non-monotone  circuit  size.  The  importance  of  this  statement 
is  emphasized  by  the  existence  of  NP-complete  slice  functions.  If  such  a  problem  can  be  shown 
to  have  a  super-polynomial  slice  function,  then  P  f  NP. 

THEOREM  9.6.9  Let  f  :  Bn  i— >  B  be  a  slice  function.  Then 

Cuff)  <  Cnmon(/)  <  2  •  CQff)  +  0{n  log2  n) 

Proof  The  first  inequality  holds  because  the  standard  basis  flo  contains  the  monotone  basis. 
To  establish  the  second  inequality,  we  convert  a  circuit  over  flo  by  moving  all  negations  to 
the  input  variables.  This  can  be  done  by  at  most  doubling  the  number  of  gates.  (See 
Problems  9.11  and  2.12.) 

We  now  show  that  for  slice  functions  the  negation  of  an  input  variable  can  be  replaced 
by  the  pseudo-negation  function  T^ff.  To  see  this,  observe  that  when  |a;|  >  fc,  at  least 
\x\  —  1  =  fc  of  the  variables  of  ^ff  are  1  and  T^'f  has  value  1.  On  the  other  hand, 
when  |  a;  |  <  fc,  then  not  enough  variables  can  be  1  for  rjff  to  have  value  1.  Finally,  when 
|  a;  |  =  fc,  fff  =  0  if  Xi  =  1  because  not  enough  of  the  remaining  variables  are  1,  and 
fff  =  1  when  Xi  =  0  by  a  similar  reasoning.  Now  replace  Xi  with  rff .  Since  /  is  a 

fc-slice,  /  =  0  when  \x  |  <  fc,  as  is  T^ff.  If  Xi  =  1  when  |  a;  |  <  fc,  replacing  Xi  by  its 
pseudo-negation  means  replacing  Xi  by  0,  which  can  only  decrease  the  circuit  output  since 
it  is  monotone.  Thus,  /  is  computed  correctly  in  this  case.  The  same  is  true  if  \x\  >  fc, 
again  by  monotonicity.  Since  =  Xi  when  |a;|  =  fc,  the  circuit  correctly  computes  / 
for  all  inputs  when  Xi  is  replaced  by  the  ith  pseudo-negation.  ■ 

AN  NP-COMPLETE  SLICE  FUNCTION  We  now  exhibit  the  language  HALF-CLIQUE  CENTRAL 
SLICE  and  show  it  is  NP  -complete.  The  characteristic  functions  of  this  language  are  slice  func¬ 
tions.  It  follows  from  Theorem  9.6.9  that  if  these  slice  functions  have  exponential  circuit  size, 
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then  P  f  NP.  We  show  that  HALF-CLIQUE  CENTRAL  SLICE  is  NP-complete  by  reducing 
HALF-CLIQUE  (see  Problem  8.25)  to  it. 

DEFINITION  9.6.7  A  central  slice  of  a  function  f  :  Bn  i— >  B  on  n  variables,  f  d"/2!!,  is  the 
\n/ 2]  th  slice. 


A  central  slice  of  a  function  f  on  n  variables  is  the  function  that  has  value  0  if  the  weight 
of  the  input  tuple  is  less  than  \n/ 2] ,  value  1  if  the  weight  exceeds  this  value,  and  is  equal  to 
the  value  of  /  otherwise. 

Given  the  function  /  :  B*  i— >  £>,  denotes  the  function  restricted  to  strings  of  length 
n.  The  family  of  central  slice  functions  |  n  >  2}  identifies  the  language 

Tcentral(/)  =  {x  G  Bn  |  (/<">)  (x)  =  1  ,71  >  2}. 

The  central  clique  function  fjfque  \nm  ^as  value  1  if  the  input  graph  contains  a  clique 

on  \n/2\  vertices.  The  central  slice  of  the  central  clique  function  /('i"c|ue  \ni2\  is  called  the 

half-clique  central  slice  function  and  denoted  /t']"qUe  siice-  It  has  value  1  if  the  input  graph 
either  contains  a  clique  on  \n/ 2]  vertices  or  contains  more  edges  than  are  in  a  clique  of  this 
size. 

The  language  HALF-CLIQUE  is  defined  in  Problem  8.25  as  strings  describing  a  graph  and 
an  integer  k  such  that  a  graph  on  n  vertices  contains  an  n/2-clique  or  has  more  than  k  edges. 
The  language  HALF-CLIQUE  CENTRAL  SLICE  associated  with  the  central  slice  of  a  central 
clique  function  is  defined  below.  It  simplifies  the  following  discussion  to  define  e(fc)  as  the 
number  of  edges  between  a  set  of  k  vertices.  Clearly,  e(k)  =  © . 

HALF-CLIQUE  CENTRAL  SLICE 

Instance:  The  description  of  an  undirected  graph  G  =  (V,E)  in  which  |  V\  is  even. 
Answer:  “Yes”  if  G  contains  a  clique  on  \V |/2  vertices  or  at  least  e(| V\/2) /2  edges. 

THEOREM  9.6.10  The  language  HALF-CLIQUE  CENTRAL  SLICE  is  TAV-complete.  Further¬ 
more,  for  all  2  <  k  <  n 


Cnn 


f  An)  ^  ^  <  Co  ( 

l  ■'clique, \n/2  \  )  J  —  ^mon  clique  slice  ) 


Fork  <  e(n/2),  (/£ue:fn/2l)[  '  =  t^}. 

Proof  We  show  that  HALF-CLIQUE  CENTRAL  SLICE  is  NP-complete  by  reducing  HALF¬ 
CLIQUE  to  it.  Given  a  graph  G  =  ( V ,  E)  in  HALF-CLIQUE  that  has  n  vertices,  n  even,  we 
construct  a  graph  G'  =  {V' ,  E')  on  5 n  vertices  such  that  G  either  contains  an  n/2-clique 
or  has  more  than  k  edges  if  and  only  if  G'  contains  a  (central)  clique  on  5n/2  vertices  or 
has  at  least  [e(5n/2)/2]  edges.  The  construction,  which  can  be  done  in  polynomial  time, 
transforms  a  graph  on  n  vertices  to  one  on  5 n  vertices  such  that  the  former  is  an  instance  of 
HALF-CLIQUE  if  and  only  if  the  latter  is  an  instance  of  HALF-CLIQUE  CENTRAL  SLICE. 

Let  V  =  {v\,  i>2,  ■  ■  ■ ,  vn}.  Construct  G'  from  G  by  adding  the  An  vertices  R  = 
{r\,  r2,  ■  ■  ■ ,  T2n}  and  S  =  {si,  S2, . . . ,  S2n }■  Represent  edges  in  E'  of  G'  with  the  edge 
variables  {yij  \  1  <  i  <  j  <  5 n}.  Each  edge  between  vertices  of  G  is  an  edge  between 
vertices  V  of  GL  Let  every  edge  between  vertices  in  R  be  in  G'  as  well  as  all  edges  between 
vertices  in  V  and  R.  Set  the  edge  variables  so  that  the  edges  between  r,;  and  Si,  1  <  i  <  2 n, 
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are  absent.  The  unassigned  variables  are  between  vertices  in  S,  between  vertices  in  R  and  S, 
and  between  vertices  in  V  and  S,  of  which  there  are  8 n2  —  3 n.  Fix  these  unassigned  edges 
so  that  the  number  of  edges  between  vertices  in  1^  U -R  U  S'  is  [e(5n/2)/2]  —  k,  1  <  k  <  n. 
There  are  sufficiently  many  unassigned  edges  to  do  this. 

We  now  show  that  G  contains  an  n/2-clique  or  has  more  than  k  edges  if  and  only  if 
G'  contains  an  5n/2-clique  or  has  more  than  |~e(5n/2)/2]  edges.  If  G  has  a  n/2-clique, 
the  edges  between  V  and  R  combined  with  the  edges  between  vertices  in  R  and  those  in 
G  constitute  a  5n/2  clique  since  5n/2  vertices  in  V  U  R  are  completely  connected.  If  V 
has  more  than  k  edges,  since  there  are  exactly  |"e(5n/2)/2]  —  k  edges  between  vertices  in 
I/URUS,  G'  has  at  least  |"e(5n/2)/2]  edges.  On  the  other  hand,  ifG/  has  a  (5n/2)-clique, 
because  there  is  at  least  one  absent  edge  between  each  pair  of  vertices  (r^,  Sj),  1  <  i  <  2  n, 
the  largest  clique  on  vertices  in  R  U  S  has  size  In.  Thus,  there  must  be  a  (n/2)-clique 
on  vertices  in  V;  that  is,  G  contains  a  (n/2) -clique.  Similarly,  since  the  number  of  edges 
between  vertices  in  V  and  those  in  R  U  S  is  exactly  |~e(5n/2)/2]  —  k,  if  G'  contains  at  least 
|~e(5n/2) /2]  edges,  G  must  contain  at  least  k  edges. 

The  membership  of  graph  G  in  HALF-CLIQUE  is  determined  by  specializing  the  graph 
G'  by  mapping  its  edge  variables  to  the  constants  0  and  1  or  to  variables  of  G.  Thus, 
the  function  testing  G’s  membership  is  obtained  through  a  subfunction  reduction  of  the 
function  testing  G’’s  membership.  (See  Definition  2.4.2.)  Thus,  at  no  increase  in  circuit 


size,  for  any  k  a  circuit  for  (/d"jUe, [n/2] )  can  be  obtained  from  a  circuit  for  /c(^ue  slice. 
Thus,  the  circuit  size  for  the  latter  is  at  least  as  large  for  the  former,  which  gives  the  second 
result  of  the  theorem. 


The  statement  that  for  k  <  e(n/2),  ^ /t^”|ue  |-ra/2])  =  Tk+\  follows  from  the  ob¬ 
servation  that  for  these  values  of  k  the  value  of  the  clique  function  on  inputs  of  weight 
e(n/2)  -  1  or  less  is  0.  ■ 


As  this  theorem  indicates,  the  search  for  a  proof  that  P  yf  NP  can  be  limited  to  the  study 
of  the  monotone  circuit  size  of  the  central  slice  of  certain  monotone  functions.  Other  central 
slices  of  NP-complete  problems  have  been  shown  to  be  NP-complete  also.  (See  the  Chapter 
Notes.) 


9.7  Circuit  Depth 

Circuit  depth  and  formula  size  are  exponentially  related,  as  shown  in  Section  9.2.3.  In  this 
section  we  examine  the  depth  of  circuits  whose  operations  have  either  bounded  or  unbounded 
fan-in.  As  seen  in  Chapter  3,  circuits  of  bounded  fan-in  are  useful  in  classifying  problems  by 
their  complexity  and  in  developing  relationships  between  time  and  space  and  circuit  size  and 
depth. 

Circuits  of  unbounded  fan-in  are  constructed  of  AND  and  OR  gates  with  potentially  un¬ 
bounded  fan-in  whose  inputs  are  the  outputs  of  other  such  gates  or  literals,  namely,  variables 
and  their  negations.  Every  Boolean  function  can  be  realized  by  a  circuit  of  unbounded  fan-in 
and  bounded  depth,  as  is  seen  by  considering  the  DNF  of  a  Boolean  function:  it  corresponds  to 
a  depth-2,  unbounded  fan-in  circuit.  Knowledge  of  the  complexity  of  bounded-depth  circuits 
may  shed  light  on  the  complexity  of  bounded-fan-in  circuits. 
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In  this  section  we  first  show  that  the  depth  of  a  function  /  is  equal  to  the  communication 
complexity  of  a  related  problem  in  a  two-player  game.  Communication  complexity  is  a  measure 
of  the  amount  of  information  that  must  be  exchanged  between  two  players  to  perform  a  com¬ 
putation.  We  establish  such  a  connection  for  all  Boolean  functions  over  the  standard  basis  Ho 
and  monotone  functions  over  the  monotone  basis  fimon.  These  connections  are  used  to  derive 
lower  bounds  on  circuit  depth  for  monotone  and  non-monotone  functions.  After  establishing 
these  results  we  examine  bounded-depth  circuits  and  demonstrate  that  some  problems  require 
exponential  size  when  realized  by  such  circuits. 

9.7.1  Communication  Complexity 

We  define  a  communication  game  between  two  players  who  have  unlimited  computing  power 
and  communicate  via  an  error-free  channel.  This  game  has  sufficient  generality  to  derive 
interesting  lower  bounds  on  circuit  depth. 

DEFINITION  9.7. 1  A  communication  game  (U ,  V)  is  defined  by  sets  U,V  C  Bn,  where  U  (~l 
V  =  0.  An  instance  of  the  game  is  defined  by  u  £  U  and  v  £  V.  u  is  assigned  to  Player  I  and 
v  is  assigned  to  Player  II.  Players  alternate  sending  binary  messages  to  each  other.  We  assume  that 
the  binary  messages  form  a  prefix  code  ( no  message  is  a  prefix  for  another)  so  that  one  player  can 
determine  when  the  other  has  finished  transmitting  a  message. 

Although  each  player  has  unlimited  computing  power,  each  message  it  sends  is  a  function  of  just 
its  own  n-tuple  and  the  messages  it  has  received  previously  from  the  other  player.  The  two  functions 
used  by  the  players  to  determine  the  contents  of  their  messages  constitute  the  protocol  II  under 
which  the  communication  game  is  played.  The  protocol  also  determines  the  first  player  to  send  a 
message  and  termination  of  the  game.  The  goal  of  the  game  is  to  find  an  index  i,  1  <  i  <  n,  such 
thatu.i  fi  Vi. 

Let  n(it,  v)  denote  the  number  of  bits  exchanged  under  II  on  the  instance  (u,  v)  of  the  game 
(U,  V).  The  communication  complexity  C(U ,  V)  of  the  communication  game  (17,  V)  is  the 
minimum  over  all  protocols  II  of  the  maximum  number  of  bits  exchanged  under  II  on  any  instance 
of  ( U ,  V);  that  is, 


C(U,V)  =  min  max  Edit,  v) 
n  ueu.vev 

Note  that  there  is  always  a  position  i,  1  <  i  <  n,  such  that  Ui  fi  Vi  since  U  (~l  V  =  0. 

The  communication  game  models  a  search  problem;  given  disjoint  sets  of  n-tuples,  U  and 
V,  the  two  players  search  for  an  input  variable  on  which  the  two  n-tuples  differ.  A  related 
communication  game  measures  the  exchange  of  information  to  obtain  the  value  of  a  function 
/  :  X  x  Y  i— >  Z  on  two  variables  in  which  one  player  has  a  value  in  X  and  the  other  has 
a  value  in  Y.  The  players  must  acquire  enough  information  about  each  other’s  variable  to 
compute  the  function. 

Every  communication  problem  (17,  V),  where  U,  V  C  Bn,  can  be  solved  with  communi¬ 
cation  complexity  C(U,  V)  <  n  +  [log2  n]  by  the  following  protocol: 

•  Player  I  sends  u  to  Player  II. 

•  Player  II  determines  a  position  in  which  u  fr  v  and  sends  it  to  Player  I  using  [log2  n~\ 
bits. 
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This  bound  can  be  improved  to  C(U ,V)  <  n  +  log2  n,  where  log}  n  is  the  number  of 
times  that  |~log2]  must  be  taken  to  reduce  n  to  zero.  (See  Problem  9.39.)  The  log-star 

function  log}  n  grows  very  slowly.  For  example,  log}  10 10  is  8;  by  contrast,  log2  10 10  = 

33,219,280,949. 

These  concepts  are  illustrated  by  the  parity  communication  problem  ( U,V ),  defined 
below,  where  n  =  2k: 


U  =  {u  |  u  has  an  even  number  of  Is} 

V  =  {v  |  v  has  an  odd  number  of  Is} 

The  following  protocol  achieves  a  communication  complexity  bound  of  C(U,  V)  <  2  log2  n 
for  this  problem.  Later  we  show  it  is  best  possible. 

1 .  If  n  =  1 ,  the  players  know  where  their  tuples  differ  and  no  communication  is  necessary. 

2.  If  n  >  1,  go  to  the  next  step. 

3.  Player  I  sends  the  parity  of  the  first  n/2  bits  of  u  to  Player  II. 

4.  Since  u  f  v,  with  one  bit  Player  II  tells  Player  I  of  half  of  the  variables  on  which  u  and  v 
are  known  to  differ.  Play  is  resumed  at  the  first  step  with  the  half  of  the  variables  on  which 
they  are  known  to  differ. 

Let  K,(n )  denote  the  number  of  bits  exchanged  with  this  protocol.  Then  k(1)  =  0  and 
k (n)  <  n{n/2)  +  2,  whose  solution  is  re(n)  =  21og2  n.  Thus,  C(U,  V)  =  n{n)  <  21og2  n. 

9.7.2  General  Depth  and  Communication  Complexity 

We  now  establish  a  relationship  between  the  depth  Dq,0  (/)  of  a  Boolean  function  /  :  Bn  B 
over  the  standard  basis  flo  and  the  communication  complexity  of  a  communication  game  in 
which  U  =  /_1(0)  and  V  =  /_1(l)>  where  /^*(a)  is  the  set  of  n-tuples  for  which  /  has 
value  a.  Theorem  9.7.1  asserts  that  L?n0(/)  and  C'(/~1(0),  have  exactly  the  same 

value.  Later  we  establish  a  similar  result  for  monotone  functions  realized  over  the  monotone 
basis.  We  divide  this  result  into  two  lemmas  that  are  proved  separately. 

THEOREM  9.7.1  For  every  Boolean  function  f  :  Bn  i— >  B, 

£n0(/)  =  C'(/-1(0),r1(l)) 

The  communication  game  allows  the  two  players  to  have  unlimited  computing  power  at 
their  disposal.  Thus,  the  protocol  they  employ  can  be  an  arbitrarily  complex  function.  This 
power  reflects  the  non-uniformity  in  the  circuit  model. 

LEMMA  9.7. 1  For  all  Boolean  functions  f  :  Bn  t— >  B  and  all  U,  VC  Bn  such  that  U  C 
/-1(0)  and  V  C  the  following  bound  holds: 

C(U,V)<Dno(f) 

Proof  In  this  lemma  we  demonstrate  that  a  protocol  for  the  communication  game 
/"‘(I))  can  be  constructed  from  a  circuit  of  minimal  depth  for  the  Boolean  function  /.  We 
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assume  that  such  a  circuit  has  negations  only  on  input  variables.  By  Problem  9.1 1  there  is 
such  a  circuit. 

Given  an  instance  defined  by  u  £  /-1( 0)  and  v  £  f  l(  1)>  the  players  follow  a  path 
from  the  circuit  output  to  an  input  at  which  it  and  v  differ.  The  invariant  that  applies  at 
each  step  is  that  Player  I  (which  holds  it)  simulates  an  AND  gate  whose  value  on  it  is  0 
whereas  Player  II  (which  holds  v)  simulates  an  OR  gate  whose  value  on  v  is  1.  The  bits 
transmitted  by  one  player  to  the  other  specify  which  input  to  the  current  gate  to  follow  on 
the  way  from  the  output  vertex  to  an  input  vertex  of  the  circuit  for  /. 

The  proof  is  by  induction.  The  base  case  applies  to  those  Boolean  functions  /  for  which 
Dq0  (/)  =  0.  In  this  case  /  is  either  Xi  or  5 h  for  some  i  where  Xi  is  an  input  variable  of 
/.  Thus,  for  each  instance  of  the  problem,  both  players  know  in  advance  a  variable  (namely, 
Xi)  on  which  it  and  v  differ.  Hence,  C(U,  V)  =  0  and  the  base  case  is  established. 

For  the  induction  step,  either  /  =  /i  A  fy  or  /  =  /)  V  fy.  Consider  the  first  case;  the 
second  is  treated  in  a  similar  fashion.  Obviously  Dq0(/)  =  max(Tfj0(/i),  -Dq0(/ 2))  +  1. 
(We  are  considering  circuits  of  minimal  depth.)  Let  Uj  =  U  fl  f~ 1  (0)  for  j  =  1 , 2.  Since 
(Uj,  V)  is  a  communication  game  associated  with  fj  ( fj  must  have  value  1  on  V)  and 
Dq  0(fj)  <  Dn0(f),  by  induction  C(UjyV)  < 

Since  the  output  gate  is  AND  (the  other  case  is  treated  similarly),  both  /)  and  j'2  have 
value  1  on  V,  but  at  least  one  of  them  has  value  0  on  U.  We  use  the  following  protocol  for 
([/,  V):  Player  I  sends  0  if  u  £  U\  (associated  with  the  input  / 1  to  this  AND  gate)  and  1 
if  it  €  U2  (associated  with  the  input  fy).  (If  the  output  gate  is  OR,  we  observe  that  at  least 
one  of  fi  and  /2  has  value  1  on  V  and  define  V\  =  V  D  /©(l)  and  V2  =  V  D  /2_I(  1). 
Player  II  sends  a  bit  to  specify  the  set  containing  v.)  After  the  first  move  the  players  follow 
the  protocol  for  the  fj  defined  by  the  bit  sent  by  Player  I.  Thus,  when  the  output  gate  is 
AND  the  following  bound  holds: 

C{U,V)  <  1  +ma x(C(Uj,B))  <  1  +  max(Dn0(/1),  £>n„(/2))  =  T>n0(/) 

j-l,2 

The  same  bound  holds  when  the  output  gate  is  OR.  ■ 

We  now  prove  the  second  half  of  Theorem  9.7. 1 . 

LEMMA  9.7.2  Let  U,V  C  Bn  be  such  that  U  (~l  V  =  0.  Then  there  exists  a  Boolean  function 
f  :  Bn  1— >  B  with  U  C  /_1(  0)  andV  C  /“'(l)  such  that  the  following  bound  holds: 

Dn0(f)  <  C(U,  V) 

Proof  In  this  proof  we  show  how  to  define  a  Boolean  function  and  a  circuit  for  it  from  a 
protocol  for  ( U ,  17).  From  the  protocol  a  tree  is  constructed.  The  root  is  associated  with  the 
player  who  sends  the  first  bit.  As  in  the  proof  of  Lemma  9.7.1,  Player  I  is  associated  with 
AND  gates  and  Player  II  with  OR  gates.  Thus,  if  the  protocol  specifies  that  Player  I  makes 
the  first  move,  the  root  is  labeled  AND.  The  two  possible  descendants  are  labeled  with  the 
player  who  makes  the  next  transmission  or  by  a  variable  or  its  negation  (the  answer)  if  this 
is  the  last  transmission  under  the  protocol.  The  function  associated  with  the  protocol  is  the 
function  computed  by  the  circuit  so  constructed. 

We  establish  the  result  by  induction.  The  base  case  applies  to  sets  U  and  17  for  which 
C(U,  17)  =  0.  In  this  case,  there  is  an  index  i  known  in  advance  to  both  players  on  which 
it  £  U  and  v  £  V  differ.  Since  either  m  =  1  or  Ui  =  0  for  all  it  £  U  (l>i  has  the 
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complementary  value  for  all  v  £  V),  let  /  =  Xi  in  the  first  case  and  f  =  Xi  in  the  second. 
Thus,  in  the  first  case  (the  second  case  is  treated  similarly)  U  C  /-1(0),  V  C  /-1(l)  and 
=  0.  This  establishes  the  base  case. 

For  the  induction  step,  without  loss  of  generality,  let  Player  I  send  the  first  bit.  (The 
other  case  is  treated  similarly.)  For  some  partition  of  U  =  Uq  U  JJ\,  Uq  (~l  U\  =  0,  Player  I 
sends  a  0  if  it  £  Uq  and  a  1  if  it  £  U\,  after  which  the  players  play  with  the  best  protocol 
for  each  subcase.  It  follows  that 

C(U,  V)=l  +  max(C(Uj,  V )) 

3= 1.2 

Since  C(Uj,  V)  <  C(U ,  V )  for  j  =  I,  2,  by  induction  there  exist  Boolean  functions  / ) 
and  f2  such  that  Uj  C  /“'( 0)  and  V  C  /r1(l)  and  Dn0(fj)  <  C(UV  V )  for  j  =  1,2. 
Since  the  output  vertex  is  assumed  to  be  AND,  /  =  /i  A  f2,  f  has  value  1  only  when  both 
/i  and  f2  have  value  1  and  has  value  0  when  either  f\  or  f2  have  value  0.  Thus,  we  have 

vcf~\i)nf2\i)  =  r\i) 
u  =  u1uu2c  f-\ 0)  u  f2-\ 0)  =  /-'( 0) 

from  which  we  conclude  that 

DnM)  <  1  +max(£>no(/1)J£>no(/2))  <  1  +  ma x(C(UrV))  =  C(U,V) 

3= 1,2 

which  is  the  desired  result.  ■ 

This  establishes  the  connection  between  the  depth  of  a  Boolean  function  /  over  the  stan¬ 
dard  basis  flo  and  the  communication  complexity  associated  with  the  sets  /-1( 0)  and  /~*(1). 

We  now  draw  some  conclusions  from  Theorem  9.7.1.  From  the  observation  made  above 
that  C(U,  V)  <  n  +  logj  n  for  an  arbitrary  communication  problem  (U ,  V)  when  U,V  £ 
Bn ,  we  have  that  Dq0(/)  <  n  +  logj  n  for  all  /  :  Bn  ^  B.  A  better  upper  bound  of 
-Dfi„(/)  <  n+1  is  given  in  Theorem  2.13.1.  The  best  upper  bound  of  n  —  log2  log2  n+0(  1) 
has  been  derived  by  Gaskov  [110],  matching  the  lower  bound  of  n  —  ©(log  log  n)  derived  in 
Theorem  2.12.2. 

The  parity  communication  problem  described  above  is  defined  in  terms  of  the  two  sets 
that  are  the  inverse  images  of  the  parity  function  :  Bn  i— >  B.  As  stated  in  Problem  9.28, 
this  function  has  a  formula  size  of  at  least  n2.  Since  Dq(/)  >  log2  Lq0(/)  (Theorem  9.2.2), 
it  follows  that  Dq  ( )  >  21og2  n,  which  matches  the  upper  bound  on  the  communication 
complexity  of  the  parity  communication  problem.  Thus  the  protocol  given  earlier  for  this 
problem  is  optimal. 

We  now  introduce  the  monotone  communication  game  and  develop  a  relationship  be¬ 
tween  its  complexity  and  the  depth  of  monotone  functions  over  a  monotone  basis. 

9.7.3  Monotone  Depth  and  Communication  Complexity 

We  specialize  Theorem  9.7.1  to  monotone  functions  by  using  the  fact  that  if  /  :  Bn  i — is 
monotone  and  there  are  two  n-tuples  u  and  v  such  that  /(it)  =  0  and  /(it)  =  1,  then  there 
exists  an  index  i,  1  <  i  <  n,  such  that  Ui  <  Vi,  that  is,  Ui  =  0  and  Vi  =  1 . 

The  binary  n- tuple  x  can  be  defined  by  the  set  {i  \  Xi  =  1}  of  indices  on  which  variables 
have  value  1.  This  is  a  subset  of  [n]  =  {1, 2, .  . . ,  n}.  Let  2^  be  the  power  set  of  [n],  that 
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is,  the  set  of  all  subsets  of  [n],  A  monotone  minterm  (monotone  maxterm)  is  a  minimal 
set  of  indices  of  variables  that  if  set  to  1  (0)  cause  /  to  assume  value  1  (0) .  (The  variables 
of  a  monotone  minterm  are  variables  in  a  monotone  prime  implicant  of  /.)  Let  min(f) 
and  max(f)  be  the  set  of  monotone  minterms  and  monotone  maxterms  of  /,  respectively. 
Observe  that  min(f)  (~l  max(f)  fi  0  because  if  they  have  no  elements  in  common,  /  can 
be  made  to  assume  values  0  and  1  simultaneously  for  some  assignment  to  the  variables  of  /,  a 
contradiction. 

DEFINITION  9.7.2  A  monotone  communication  game  ( A ,  B)  is  defined  by  sets  A,  B  C  2^n\ 
An  instance  of  the  game  is  a  pair  (a,  b)  where  a  €  A  and  b  £  B.  a  is  assigned  to  Player  I 
and  b  is  assigned  to  Player  II.  Players  alternate  sending  messages  as  in  the  communication  game, 
using  a  predetermined  protocol.  The  goal  of  the  problem  is  to  find  an  integer  i  £  a  D  b.  The 
communication  complexity,  Cm0n(^>  B),  is  defined  as  the  minimum  over  all  protocols  II  of 
the  maximum  number  of  bits  exchanged  under  II  on  any  instance  of  {A,  B): 

Clnon(A,  B)  =  nun  max  II (a,b) 

II  a(zA,b(z.B 

We  now  establish  a  relationship  between  this  complexity  measure  and  the  circuit  depth  of 
a  Boolean  function. 

THEOREM  9.7.2  For  every  monotone  Boolean  function  f  :  Bn  i— »  B, 

A-2mo „(/)  =  C(f-\0),rl(l))  =  Cmon(min(f),  max(f)) 

Proof  We  show  that  7^amon(/)  =  C7(/  1  (0),  /_1(l))  by  specializing  Lemmas  9.7.1  and 
9.7.2  to  monotone  functions.  In  the  base  case  of  Lemma  9.7. 1  since  the  circuit  is  monotone 
we  always  discover  a  coordinate  such  that  Ui  =  0  and  Vi  =  1  and  negations  are  not  needed. 
Thus,  C(/_1(0)>  /_1(l))  <  Bq mon(/).  In  Lemma  9.7.2,  since  the  protocol  provides 
a  coordinate  i  such  that  Ui  =  0  and  Vi  =  1,  the  circuit  defined  by  it  is  monotone  and 

We  show  that  C(f  *(0),/  '(1))  =  Cmon(min(f),  max(f))  in  two  stages.  First  we 
show  that  Cmon(min(f),  max(f))  <  C(/_1(0),  /-1(1)).  This  follows  because,  given 
any  a  £  min(f)  and  b  £  max(f),  we  extend  a  and  b  to  binary  n-tuples  u  and  v  for 
which  ur  =  0  for  r  £  a  and  v s  =  1  for  s  £  b  and  use  the  protocol  for  the  monotone 
communication  game  to  find  an  index  i  such  that  Ui  =  0  and  Vi  =  1 ,  that  is,  for  which 
i  £  a  l~l  b.  Thus,  the  monotone  communication  game  exchanges  no  more  bits  than  the 
standard  game. 

To  show  that  C'(/_1(0), /_1(1))  <  Cmon(min(f),  max(f)),  consider  an  instance 
(u,v)  of  (U,  V)  where  U  =  /_1( 0)  and  V  =  /-1(  1).  To  solve  the  communication 
problem  (U,V),  let  a(tt)  £  [n]  be  defined  by  r  £  a(u)  if  and  only  if  ar  =  0  and  let 
b(v)  £  [n]  be  defined  by  s  £  b(v)  if  and  only  if  =  1.  The  goal  of  the  standard 
communication  game  is  to  find  an  index  i  such  that  Uifivi.  It  follows  from  the  definition 
of  minterms  and  maxterms  that  there  exist  p  £  min(f)  and  q  £  max(f)  such  that  p  C  a 
and  q  C  b.  Since  each  player  has  unlimited  computing  resources  available,  computation  of 
p  and  q  can  be  done  with  no  communication  cost.  Now  invoke  the  protocol  on  the  instance 
( p ,  q)  of  the  monotone  communication  game  ( min(f ),  max(f)).  This  protocol  returns  an 
index  i  £  p  [~l  q  that  is  also  an  index  on  which  u  and  v  differ.  But  this  is  a  solution  to 
the  instance  of  ( u ,  v )  of  (/_1  (0),  /_1(l)).  Thus,  no  more  bits  are  communicated  to  solve 
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the  standard  communication  game  than  are  exchanged  with  the  monotone  communication 
game  when  the  sets  U  and  V  are  the  inverse  images  of  a  monotone  Boolean  function.  ■ 

In  the  next  section  we  use  the  above  result  to  derive  a  large  lower  bound  on  the  monotone 
depth  of  the  clique  function. 

9.7.4  The  Monotone  Depth  of  the  Clique  Function 

In  this  section  we  illustrate  the  use  of  the  monotone  communication  game  by  showing  that 
in  this  game  at  least  f l(Vk)  bits  must  be  exchanged  between  two  players  to  compute  the 
clique  function  f^quek  ■  DpA™-1)/2  i— ►  B  defined  in  Section  9.6  when  fc  <  (n/2)2/3.  The 

inputs  to  fj; i"qUe  k  are  variables  associated  with  the  edges  of  a  graph  on  n  vertices.  If  an  edge 
variable  Bij  =  1 ,  the  edge  between  vertices  i  and  j  is  present.  Otherwise,  it  is  absent.  By 
Theorem  9.7.2,  a  lower  bound  of  f l(Vk)  on  the  number  of  bits  that  must  be  exchanged 
between  the  two  players  to  compute  ./^]"f|ue  k  implies  that  f^ique  k  has  depth  f 

THE  RULES  OF  THE  GAME  Fix  n  and  k.  The  players  in  this  communication  game  are  each 
given  sets  of  edges  of  graphs  on  n  vertices.  Player  I  is  given  a  set  of  edges  that  contains  a  fc- 
clique  (an  input  on  which  k  has  value  1 ,  a  positive  instance)  whereas  Player  II  is  given 

a  set  of  edges  that  does  not  contain  a  fc-clique  (an  input  on  which  it  has  value  0,  a  negative 
instance).  The  goal  of  the  game  is  to  exchange  the  minimum  number  of  bits  for  the  worst-case 
instances  to  permit  the  players  to  identify  an  edge  variable  that  is  1  on  a  positive  instance  and 
0  on  a  negative  one.  This  number  of  bits  is  the  communication  complexity  of  the  game. 

To  derive  the  lower  bound  on  communication  complexity,  we  restrict  the  graphs  under 
consideration  by  choosing  them  so  that  every  protocol  must  exchange  a  lot  of  data  (this  cannot 
make  the  worst  cases  any  worse).  In  particular,  we  give  Player  I  only  fc-cliques,  the  set  of 
graphs,  CLQ,  whose  only  edges  are  those  between  an  arbitrary  set  of  k  vertices.  We  call  Player 
I  the  clique  player.  Also,  we  give  Player  II  a  (fc  —  1) -coloring  drawn  from  the  set  COL  of  all 
possible  assignments  of  k  —  1  colors  to  the  n  vertices  of  a  graph  G.  The  interpretation  of  a 
(fc  —  1) -coloring  is  that  two  vertices  can  have  the  same  color  only  if  there  is  no  edge  between 
them.  Thus,  any  graph  that  has  a  (fc  —  l)-coloring  cannot  contain  a  fc-clique  because  the  k 
vertices  in  such  a  subgraph  must  have  different  colors.  We  call  Player  II  the  color  player.  The 
goal  now  becomes  for  the  two  players  to  find  a  monochromatic  edge  (both  endpoints  have  the 
same  color)  owned  by  the  clique  player. 

In  the  standard  communication  game  players  alternate  exchanging  binary  messages.  We 
simplify  our  discussion  by  assuming  that  each  player  transmits  one  bit  simultaneously  on  each 
round.  We  then  find  a  lower  bound  on  the  number  of  rounds  and  use  this  as  a  lower  bound 
on  the  number  of  bits  exchanged  between  the  two  players. 

AN  ADVERSARIAL  STRATEGY  We  describe  an  adversarial  strategy  for  the  selection  of  cliques  and 
colorings  that  insures  that  many  rounds  are  needed  for  the  two  players  to  arrive  at  a  decision. 
To  present  the  strategy,  we  need  some  notation. 

Let  CLQo  denote  the  set  of  graphs  G  =  (17,  E)  on  n  vertices  that  contain  only  those  edges 
in  a  fc-clique.  It  follows  that  CLQo  contains  (([)  graphs.  Let  COLo  denote  the  set  of  (fc  —  1)- 
colorings  of  graphs  on  n  vertices,  that  is,  COL0  =  {c  j  c  :  17  i— >  [fc  —  1]},  where  [fc  —  1] 
denotes  the  set  { 1, 2, . . . ,  fc  —  1}.  It  follows  that  COL0  contains  (fc  —  1)"  (fc  —  l)-colorings. 
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We  execute  a  series  of  rounds.  During  each  round  each  player  provides  one  bit  of  infor¬ 
mation  to  the  other.  This  information  has  the  effect  of  reducing  the  uncertainty  of  the  color 
player  about  the  possible  fc-cliques  held  by  the  clique  player  and  of  reducing  the  uncertainty  of 
the  clique  player  about  the  possible  ( fc  —  l)-colorings  held  by  the  color  player.  The  adversary 
makes  the  uncertainty  large  after  each  round  so  that  the  number  of  rounds  needed  will  be  large 
and  a  structure  of  the  sets  of  cliques  and  colorings  that  can  be  analyzed  will  be  maintained. 
The  game  ends  when  both  players  have  found  a  monochromatic  edge  that  is  in  a  clique. 

Let  Pt  C  V  and  Mt  C  V  denote  the  vertices  that  after  the  fth  round  are  present  in  every 
fc-clique  and  missing  from  every  fc-clique,  respectively.  (Let  pt  =  \Pt  |  and  mt  =  \Mf\.)  Since 
vertices  in  Mt  are  not  in  any  cliques  after  the  fth  round,  as  we  shall  see,  each  such  vertex  can 
be  assigned  the  same  color  as  a  “friend”  after  all  vertices  not  in  Mt  have  been  colored.  Also, 
after  the  fth  round  the  vertices  in  a  fc-clique  consist  of  vertices  in  V  —  Mt  of  which  those  in 
Pt  are  the  same  for  all  such  cliques. 

Let  CLQ(D,  Pt,  Mt )  denote  the  set  of  fc-cliques  containing  Pt  but  no  vertex  in  Mt.  Let 
COL(V,  Mt)  denote  the  (fc  —  l)-colorings  of  vertices  not  in  Mt  after  the  fth  round.  Then 
\CLQ_(V,  Pt,Mt)\  =  (nkZP;~Z)  and  |COL(V,Mt)|  =  (n  -  mt)k~l  are  the  maximum 
numbers  of  fc-cliques  and  (fc  —  1  (-colorings  that  are  possible  after  the  fth  round.  Let  CLQf 
and  COLt  denote  the  actual  number  of  cliques  and  colorings  that  are  consistent  with  the 
information  exchanged  between  players  after  the  fth  round. 

Given  two  sets  A  and  B,  A  C  B,  we  introduce  a  measure  ps{A)  =  |A|/|L>|  used  in 
deriving  our  lower  bound.  For  an  element  x  €  A,  pb(A)  is  a  rough  measure  of  the  amount 
of  information  that  can  be  deduced  about  x.  The  smaller  the  value  of  ps(A),  the  more 
information  we  have  about  x.  This  measure  is  specialized  to  cliques  and  colorings  after  the  fth 
round: 


MCLQ(y,pt,Mt)(CLQt)  —  \CLQt\/\CLQ(V ,  Pt,  Mt)\ 
McoL(v,Mt)(COLt)  =  |COLt|/|COL(]/,Mt)| 

Since  the  color  player  does  not  know  the  identity  of  vertices  in  Pt  until  after  the  fth 
round,  its  information  about  the  clique  held  by  the  other  player  is  measured  by  pt  and 
MCLQ(y,Pt,Mt)(CLQi).  Since  the  clique  player  only  knows  the  color  of  vertices  Mt  that 
are  missing  in  all  cliques  after  the  fth  round,  its  information  about  a  (fc  —  1) -coloring  by  the 
color  player  is  measured  by  rrit  and  ftcoL(y,Mt)(COLt). 

The  number  of  rounds,  T,  is  large  if  for  f  =  T  no  edge  present  in  all  remaining  cliques 
CLQf  that  is  monochromatic  in  all  remaining  colorings  COLf.  We  show  that  an  adversary 
can  choose  the  sets  CLQt  and  COLt  at  each  round  so  that  many  rounds  are  needed. 

SELECTION  OF  THE  SETS  CLQt  AND  COLT  BY  THE  ADVERSARY:  Let  the  value  of  the  bits  sent  by 
the  clique  and  color  players  be  be lq  and  bcoL>  respectively.  At  the  fth  round  the  following 
algorithm  is  used  to  choose  CLQt  and  COLt : 

1)  Let  P  =  Pt- 1,  p  =  pt,  M  =  Mt- 1  and  m  =  mt-\.  Let  CLQ1  be  the  larger  of  the 
two  subsets  of  CLQt_i  consistent  with  the  values  &clq  =  0  and  b clq  =  L  Thus, 

flCLQ(y,p,M)(CLQ1)  >  ftcLQ(y,p,M)(CLQt_j)/2. 

2)  Let  CLQ  be  a  collection  of  fc-cliques.  Then  the  set  of  cliques  q  in  CLQ  containing  the 
vertex  v  is  denoted  CLQ©  =  {q  £  CLQ  |  v  £  q}. 
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Let  CLQ  =  CLQ1.  As  long  as  there  exists  v  £  V  —  P  —  M  such  that  the  following  is 
true: 

McLQ(v,p,Af)(CLQ(t;))  >  ^clq( v,p,m) (CLQ)  (9-2) 

replace  P  by  P*  =  P  U  {p},  pby  p*  =  p  +  l,  and  CLQ  by  CLQ*  =  CLQ(u).  Here 
( k-p -  m)  pclq(v, p,m)  (CLQ)/ (n-p-m)  is  the  average  of  Mclq  (v,p,m)  (CLQ  (v) ) 
over  all  v  S  V  —  P  —  M.  Thus,  CLQ(t;)  has  measure  at  least  twice  the  average. 

Since  |CLQ(V,  P*,M)|  =  (k— p— ra)|CLQ(I7,  P,  M)\/(n— p—  m)  after  each  iteration 
of  this  loop,  the  following  bound  holds: 

ATclq(v,p*,m)(CLQ*)  >  2^CLQ(yipiM)(CLQ) 

That  is,  the  renormalized  measure  of  the  set  of  cliques  after  one  iteration  of  the  loop  is  at 
least  double  that  of  the  measure  before  the  iteration. 

After  exiting  from  this  loop  let  CLQ/  =  CLQ*  and  let  Pt  =  P.  Since  Pt  contains 
Pt  —  pt- 1  more  items  than  Pt_i,  the  following  inequality  holds: 

MCLQ(V,Pt,Mt)(CLQ/)  >  2Pt  ~Pt~'  PCLQ(V,Pt-i,Mt-i)  (CLQ1 ) 

>  2Pt_Pt-1/iCLQ(y,pt_1,Mt_1)(CLQi_1)/2  (9.3) 

Furthermore,  for  any  vertex  v  remaining  in  V  —  P  the  condition  expressed  in  (9.2)  is 
violated,  so  that  the  following  holds  for  v  £  V  —  P,  where  a  =  2(k  —  pt  —  mt-\)/(n  — 
pt  -  mt- 1): 

Mclq (y,pt>Mt_,)({9  £  CLQ/  |  v  £  q})  <  a  (MCLQ(v,ptlAft_,)(CLQ/))  (9.4) 

3)  Let  COL/  =  {c  €  COLt_i  |  cis  1-1  onPt}.  That  is,  COL/  is  the  set  of  ( k  —  1)- 
colorings  in  COLt  that  assigns  unique  colors  to  vertices  in  Pt.  By  restricting  the  ( k  —  1)- 
colorings  we  do  not  increase  the  number  of  rounds.  In  Lemma  9.7.3  we  develop  a  lower 
bound  on  AtCOL(y,Mt_,)(COL/)  in  terms  of  MCOL(y,Mt_1)(COLt_1). 

4)  Let  M  =  Mt— i  and  m  =  mt-i.  Let  COL°  and  COL1  denote  the  subsets  of  COL/ 
consistent  with  the  values  &col  =  0  and  &col  =  1 ,  respectively.  Let  COL  be  the  larger 
of  these  two  sets.  Then  /zCOl(v,m)  (COL)  >  ^COL(y,M)(COL/)/2. 

5)  The  set  COLt(u,  v)  =  {c  €  COL  j  c(u)  =  c(p)}  contains  those  ( k  —  1) -colorings  in 
COL  for  which  vertices  u  and  v  have  the  same  color. 

As  long  as  there  exist  u,  v  €  V  —  M  such  that  the  following  is  true: 

McoL(v,M)(COLt(u,  Q)  >  2/zCOL(yiM)(COL)/(fc  —  1) 

let  w  be  one  of  u  and  v  that  is  not  in  P  (they  cannot  both  be  in  P  and  have  the  same  color 
because  each  coloring  is  1-1  on  P);  replace  M  by  M*  =  M  U{u;},TOby  m*  =  m  +  1, 
and  COL  by  COL*  =  COLt(it,  v). 

The  term  AtCOL(y,M)  (COL) /(A;  —  1)  is  the  average  of  £tcOL(v,M)(COLt(zt,  Q)  over  all 
u  and  v  in  V  —  M.  Thus,  COL*  contains  ( k  —  l)-colorings  whose  measure  is  at  least 
twice  the  average. 
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Since  |COL©,M*)|  =  |COL©,  M) \/(k  —  1)  after  each  iteration  of  this  loop,  the 
following  holds: 

Mcol(v,m*)(COL*)  >  2/tcol(v,m)(COL) 

That  is,  the  renormalized  measure  of  the  set  of  (k  —  1) -colorings  after  each  loop  iteration 
is  at  least  double  that  of  the  measure  before  the  iteration. 

After  exiting  from  this  loop,  let  Mt  =  M .  Since  Mt  contains  mt  —  Trit—i  more  items  than 
Mt_u  the  following  inequality  holds: 

McoL(y,Mt_1)(COL*)  >  2™‘-m‘-1McOL(y>Mt_l)(COL) 

>  2mt _m*-‘ /iCOL(tz,Mt_i)  (COL* )/2  (9.5) 

6)  Let  COLt  =  COL*,  Mt  =  M,  and  CLQt  =  {q  £  CLQ*  |  Mt  0  q  =  0}.  Thus,  CLQt 
does  not  contain  any  cliques  with  vertices  in  Mt.  In  Lemma  9.7.4  we  develop  a  lower 
bound  on  ncLQ(v,pt,Mt-l) (CLQt) in  terms  of  Mclq(v, (CLQt ). 


PERFORMANCE  OF  THE  ADVERSARIAL  STRATEGY  We  establish  three  lemmas  and  then  derive 
the  lower  bound  on  the  number  of  rounds  of  the  communication  game. 


LEMMA  9.7.3  After  step  3  of  the  adversarial  selection  the  following  inequality  holds: 

McoL(y,Mt_!)(COLt )  >  ^1  -  McoL(y,Mt_1)(COLt_1) 

Proof  Recall  the  definition  of  COLt(ti,  v)  =  {c  £  COL  |  c(u)  =  c©}.  Consider  the 
results  of  step  3  of  the  fth  round  in  the  adversary  selection  process.  Because  of  the  choices 
made  in  step  5  in  the  [t  —  l)st  round  and  the  choice  of  COL0,  the  following  inequality 
holds  for  all  t  >  0  and  u,  v  £  V  —  Mt-\  when  u  f  v: 


McoL(y,Mt_1)(COLt(u, v))  <  2/iCOL(y,Mt_1)(COLt_1)/(/i  —  1) 

Because  Mt  =  Mt-i  at  step  3  of  the  fth  round  and  Pt  C  V  —  Mt,  the  same  bound  applies 
for  u  and  V  in  Pt. 

The  set  COLt_i  is  reduced  to  COL^  =  {c  £  COLt_i  |  c  is  1  to  1  on  Pt}  by  discard¬ 
ing  (k  —  1) -colorings  for  which  u  and  v  are  in  Pt  and  have  the  same  color.  From  the  above 
facts  the  following  inequalities  hold  (here  instances  of  the  measure  p  carry  the  subscript 

COL(V,Aft_1)): 


/r(COLj')  =  p({c  £  COLt_i  |  c  is  1  to  1  on  Pt}) 


=  /t(COLt_i)  -  ft 


IJ  COLt(u,w) 


<U,v£.Pt, 


>n{COU-i)~  Y,  C0Lt©^) 

u,vGPt,  u^v 


> 


> 


C Pt  + 1)2 


/t(COLt_i) 


k-  1 


/i(COLt_i) 
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From  this  the  conclusion  follows.  ■ 


LEMMA  9.7.4  After  step  6  of  the  adversarial  selection  the  following  inequality  holds: 

(2kmt  'N 

1 - —  j  A‘CLQ(y.Pt,Mt_1)(CLQj) 

Proof  As  stated  in  (9.4),  after  step  2  of  the  ith  round  of  the  adversary  selection  process  we 
have  for  all  V  £  V  —  Pt  —  Mt- \  the  following  inequality: 

Mclq €  CLQt  |  v  G  9})  <  ,  _  - pMcLQ(y,pt,Mt_1)(CLQJ 

Pt  Ult—lJ 

Since  Mt  C  V  —  Pt,  this  bound  applies  to  V  G  Mt.  In  the  rest  of  this  proof  all  instances  of 
p  carry  the  subscript  CLQ(V,  Pt,  Mt_f. 

Since  CLQt  =  {q  G  CLQ £  |  Mt  0  q  =  0},  after  step  6  the  following  inequalities  hold: 


p(CLQt)  =  p({c  G  CLQt  |  Mt  D  q  =  0}) 


=  p(CLQj)  -  p 


> 


> 


2(fc  -pt-  \ 

(n-pt-mt- 1)  / 


p(CLQ:) 


p(CLQ:) 


From  this  the  conclusion  follows.  ■ 


The  third  lemma  sets  the  stage  for  the  principal  result  of  this  section. 

LEMMA  9.7.5  Letk  >  landt  <  yk/Aandt  <  n/(8k).  Then  the  following  inequalities  hold: 

MCLQ(v,pt,Mt_i)(CLQt)  >  2Pt~2t 
McOL(y,Mt)(COLt)  >  2mt~2t 

Proof  The  inequalities  hold  for  t  =  0  because  PcLQ(y,P0)  (CLQ0)  =  PcoL(y,M0)(COLo) 
1 .  We  assume  as  inductive  hypothesis  that  the  inequalities  hold  for  the  first  t  —  1  rounds 
and  show  they  hold  for  the  fth  round  as  well. 

Using  the  inductive  hypothesis  and  (9.3),  we  have 

MCLQ(v,pt,Mt_,)(CLQ()  >  2pt_Pt-1pCLQ(yiPt_liMt_1)(CLQj_1)/2  >  2Pt_2t+(9.6) 

Since  PcLQ(V,Pt)  (CLQj )  <  1,  we  conclude  that  pt  <  2t  —  1.  Using  this  result,  the 
assumption  that  t  <  vk/4,  Lemma  9.7.3,  and  the  inductive  hypothesis,  we  have 

McOL(y,Mt_1)(C0Lt_i) 

McoL(y,Mt_I)(COL4_i) 

>  xMc°L(y,Mt-i)(COLt_l) 

2rnt-l~lt+l 


McoL(y,Mt_I)(COLt)  >  (  1  - 


4f2 
k-  1 
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Combining  this  and  (9.5)  (note  that  in  step  6  we  let  COLt  =  COL*),  we  have  the  first 
of  the  two  desired  conclusions,  namely  AtCOL(y,Mt)(COIJt)  >  2mt~2t .  This  implies  that 
rrit  <  2£.  Applying  this  to  the  inequality  in  Lemma  9.7.4  and  using  the  condition  t  < 
n/(8fc),  we  get  the  following  inequality: 

AtCLQ(y,pt>Mt_1)(CLQi)  >  ^CLQ(VjPtjMt_I)(CLQJ,)/2 

Combining  this  with  the  lower  bound  given  in  (9.6),  we  have  the  second  of  the  two  desired 
conclusions,  namely,  McLQ(v,Pt>Mt_,)  (CLQt)  >  2P2~2t.  m 

We  now  state  the  principal  conclusion  of  this  section. 

THEOREM  9.7.3  Let  2  <  k  <  (n/2)2/3.  Then  the  monotone  communication  complexity  of  the 
k-clique  function  f^uque  k  *s  ©V©. 

Proof  Run  the  adversarial  selection  process  for  T  =  v©4  steps  to  produce  sets  CLQT, 
COLt,  Pj-,  and  Mr-  Below  we  show  that  CLQT  and  COLt  are  not  empty.  Give  the 
clique  player  a  /c-clique  q  £  CLQT  and  the  color  player  a  (k  —  ©coloring  c  £  COLt-  To 
show  that  the  two  players  cannot  agree  in  T  or  fewer  rounds  on  an  edge  in  a  clique  in  CLQT 
that  is  monochromatic  in  all  c  £  COLt,  assume  they  can,  and  let  (u,  v)  £  q  be  that  edge. 
If  follows  that  both  u  and  v  are  in  Mt-  But  this  cannot  happen  because,  by  construction, 

g  n  mt  =  0. 

To  show  that  CLQT  and  COLt  are  not  empty,  observe  that  k  <  ('. n/2 )2/3  and  t  < 
V©/4  imply  that  t  <  n/(8k).  Thus,  Lemma  9.7.5  can  be  invoked,  which  implies  that 
Pt>  trit  <2 i  <  \fk/2  <  k/2  <  n.  Invoking  the  definitions,  the  following  inequalities  also 
hold. 


CLQt  >  2Pt~2tCLQ{V,  Pt,Mt_f  >  0 
COLt  >  2m*-2tCOL (V,Mt)  >  0 

Since  the  right-hand  sides  are  non-zero,  we  have  the  desired  conclusion.  ■ 

9.7.5  Bounded-Depth  Circuits 

As  explained  earlier,  bounded-depth  circuits  are  studied  to  help  us  understand  the  depth  of 
bounded  fan-in  circuits.  Bounded-depth  circuits  for  arbitrary  Boolean  functions  require  that 
the  fan-in  of  some  gates  be  unbounded  because  otherwise  only  a  bounded  number  of  inputs 
can  influence  the  output(s). 

In  Section  2.3  we  encountered  the  DNF,  CNF,  SOPE,  POSE,  and  RSE  normal  forms. 
Each  of  these  corresponds  to  a  circuit  of  bounded  depth.  The  DNF  and  SOPE  normal  forms 
represent  Boolean  functions  as  the  OR  of  the  AND  of  literals.  The  OR  and  each  of  the  ANDs 
is  a  function  of  a  potentially  unbounded  number  of  literals.  The  same  statement  applies  to 
the  CNF  and  POSE  normal  forms  when  AND  and  OR  are  exchanged.  The  RSE  normal  form 
represents  Boolean  functions  as  the  EXCLUSIVE  OR  of  the  AND  of  variables,  that  is,  without 
the  use  of  negation.  Again,  the  fan-in  of  the  two  types  of  operation  is  potentially  unbounded. 
As  stated  in  Problems  2.8  and  2.9,  the  SOPE  and  POSE  of  the  parity  function  f^1'  have 
exponential  size,  as  does  the  RSE  of  the  OR  function  f!f  ! .  In  Problem  2.10  it  is  stated  that 
the  function  /(©d  ,  has  exponential  size  in  the  DNF,  CNF,  and  RSE  normal  forms. 
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In  this  section  we  show  that  every  bounded-depth  circuit  for  the  parity  function  f^'l>  over 
the  basis  containing  the  NOT  gate  on  one  input  and  the  AND  and  OR  gates  on  an  arbitrary 
number  of  inputs  has  exponential  size.  Thus,  the  depth-2  result  extends  to  arbitrary  depth. 

BOUNDED-DEPTH  PARITY  CIRCUITS  HAVE  EXPONENTIAL  SIZE  We  use  an  approximation  method 
to  derive  a  lower  bound  on  the  size  of  a  bounded-depth  circuit  for  .  This  method  parallels 

almost  exactly  the  method  of  Section  9.6.3.  Starting  with  gates  most  distant  from  the  output 
and  progressing  toward  it,  replace  each  gate  of  a  given  circuit  by  an  approximating  circuit. 
We  show  that  as  each  replacement  is  made,  the  number  of  new  errors  it  introduces  is  small. 
However,  we  also  show  that  after  all  gates  are  approximated,  the  number  of  errors  between  the 
approximating  circuit  and  fjf'1  is  large.  This  implies  that  the  number  of  gates  replaced  is  large. 

The  approximation  method  used  here  replaces  each  gate  in  a  circuit  by  a  polynomial  over 
GF( 3),  the  three-element  field  containing  {—1,  0,  1},  with  the  property  that  if  the  variables 
of  such  a  polynomial  assume  values  in  B  =  {0,  1},  the  value  of  the  polynomial  is  in  B.  For 
example,  the  polynomial  Xi(l  —  £2)^3  has  value  1  over  B  only  when  x\  =  X3  =  1  and 
X2  —  0  and  has  value  0  otherwise.  Thus,  it  corresponds  exactly  to  the  minterm  X1X2X3.  Since 
every  minterm  can  be  represented  as  a  polynomial  of  this  kind,  every  Boolean  function  /  can 
realized  by  a  polynomial  over  GF( 3)  by  forming  the  sum  of  one  such  polynomial  for  each 
of  its  minterms.  A  6-approximator  is  polynomial  of  degree  b  that  approximates  a  Boolean 
function. 

Although  we  establish  the  lower  bound  for  the  basis  containing  NOT  and  the  unbounded 
fan-in  AND  and  OR  gates,  the  result  continues  to  hold  if  the  unbounded  fan-in  MOD3  function 
is  added  to  the  basis.  (See  Problem  9.41.)  We  begin  by  showing  that  the  function  computed 
by  a  circuit  C  containing  size(C)  gates  cannot  differ  from  its  6- approximator  on  too  many 
input  tuples. 

LEMMA  9.7.6  Let  f  :  Bn  1— >  B  be  computed  by  a  circuit  C  of  depth  d.  There  is  a  (2  k)d- 
approximator  circuit  C  computing  f  :  Bn  1— >  B  such  that  f  and  f  differ  on  at  most  size(C)2n~k 
input  n-tuples,  where  n  is  the  number  of  inputs  on  which  C  depends  and  size(C)  is  the  number 
of  gates  that  it  contains. 

Proof  We  construct  a  6-approximator  for  C,  b  =  (2 k)d,  by  approximating  inputs  ( Xi  and 
Xi  are  approximated  exactly  on  B  by  Xi  and  ( I  —  xf),  after  which  we  approximate  gates  all 
of  whose  inputs  have  been  approximated  until  the  output  gate  has  been  approximated.  We 
establish  the  result  of  the  lemma  by  induction. 

We  treat  the  statement  of  the  lemma  as  our  inductive  hypothesis  and  show  that  if  it  holds 
for  d  =  D  —  1,  it  holds  for  d  =  D.  The  hypothesis  holds  on  inputs,  namely,  when  d  —  0. 
Suppose  the  hypothesis  holds  for  d  =  D  —  1 .  Since  C  has  depth  d,  each  of  the  inputs  to  the 
output  gate  has  depth  at  most  D  —  1  and  satisfies  the  hypothesis.  The  output  gate  is  AND, 
OR,  or  NOT.  Suppose  it  is  NOT.  Let  <7  be  the  function  associated  with  its  input.  We  replace 
the  NOT  gate  with  the  function  (1  —  g),  which  introduces  no  new  errors.  Since  g  and  1  —  g 
have  the  same  degree,  the  inductive  hypothesis  holds  in  this  case. 

If  the  output  gate  is  the  AND  of  <?i ,  <?2»  -  -  • ,  gm>  it  can  be  represented  exactly  by  the 
function  <71(72  ■  •  •  gm-  However,  this  polynomial  has  degree  m(2k)d~1  if  each  of  its  inputs 
has  degree  at  most  (2 k)d~l;  this  violates  the  inductive  hypothesis  if  m  >  2k,  which  may 
happen  because  the  fan-in  of  the  gate  is  potentially  unbounded.  Thus  we  must  introduce 
some  error  in  order  to  reduce  the  degree  of  the  approximating  polynomial.  Since  the  OR  of 
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Si >  52.  •  ■  ■ .  9m  can  be  represented  by  1  -  (1  -  S»1)(l  -  y2)  ■  ■  •  (1  -  gm)  using  DeMorgan’s 
Rules,  both  AND  and  OR  of  g\,  y2,  ■  •  ■  >  0m  have  the  same  degree.  We  find  an  approximating 
polynomial  for  both  AND  and  OR  by  approximating  the  OR  gate. 

We  approximate  the  OR  of  0i,  02, . . . ,  gm  by  creating  subsets  Si,  S2, . . . ,  Sk  of  { <71 ,  g2, 

.  .  . ,  0m }>  computing  /,  =  9j)2’  and  combining  these  results  in 

OR(/i,/2,...,/fe)  =  l-(l-/i)(l-/2)  •••(!-  fk) 

The  degree  of  this  approximation  is  2k  times  the  maximal  degree  of  any  polynomial  in  the 
set  {01, 02,  •  ■  • ,  0m}  or  at  most  (2 k)d,  the  desired  result. 

There  is  no  error  in  this  approximation  if  the  original  OR  has  value  0.  We  now  show 
that  there  exist  subsets  Si,  S2,  ■  ■  ■ ,  Sk  such  that  the  error  is  at  most  2n-fc  when  the  original 
OR  has  value  1.  Let’s  fix  on  a  particular  input  n-tuple  x  to  the  circuit.  Suppose  each  subset 
is  formed  by  deciding  for  each  function  in  {01, y2, . . . ,  0m}  with  probability  1/2  whether 
or  not  to  include  it  in  the  set.  If  one  or  more  of  {01 , 02, . .  . ,  0m}  is  1  on  x,  the  probability 
of  choosing  a  function  for  set  whose  value  is  1  is  at  least  1/2.  Thus,  the  probability  that 
OR(/i,  /2, . . . ,  fk)  has  value  0  when  the  original  OR  has  value  1  is  the  probability  that  each 
of  /1,  /2, .  . . ,  fk  has  value  0,  which  is  at  most  2~k .  Since  the  sets  {S},  S2, . . . ,  Sk}  result 
in  an  error  on  input  x  with  probability  at  most  2~k,  the  average  number  of  errors  on  input 
x,  averaged  over  all  choices  for  the  k  sets,  is  at  most  2~fc  and  the  average  number  of  errors 
on  the  set  of  2n  inputs  is  at  most  2ra_fe.  It  follows  that  some  set  {Si,  S2, .  . . ,  Sk}  (and 
a  corresponding  approximating  function)  has  an  incorrect  value  on  at  most  2n~k  inputs. 
Since  by  the  inductive  hypothesis  at  most  ( size(C )  —  l)2”~fc  errors  occur  on  all  but  the 
output  gate,  at  most  size(C)2n~k  errors  occur  on  the  entire  circuit.  ■ 

The  next  result  demonstrates  that  a  yTi-approximator  (obtained  by  letting  k  =  nl^2d / 2) 
and  the  parity  function  must  differ  on  many  inputs.  This  is  used  to  show  that  the  circuit  being 
approximated  must  have  many  gates. 

LEMMA  9.7.7  Let  f  :  Bn  1— >  B  be  a  fn-approximator  for  fq'1  ■  Then,  f  and  f^  differ  on  at 
least  2n input  n-tuples. 

Proof  Let  U  C  Bn  be  the  n-tuples  on  which  the  functions  agree.  We  derive  an  upper 
bound  on  \U\  of  /3  =  (49)2"/50  that  implies  the  lower  bound  of  the  lemma.  We  derive 
this  bound  indirectly.  Since  there  are  functions  g  :  U  <— >  {—1,0,  1},  assign  each  one 
a  different  polynomial  and  show  that  the  number  of  such  polynomials  is  at  most  3^,  which 
implies  that  \U\  <  (3. 

Transform  the  polynomial  in  the  variables  X\,X2, . . .  ,xn  representing  /©  by  mapping 
Xi  to  ip  =  2 Xi  —  1.  This  mapping  sends  1  to  1  and  0  to  —1.  (Observe  that  y \  =  1.)  It 
does  not  change  the  degree  of  a  polynomial.  In  these  new  variables  /©  can  be  represented 
exactly  by  the  polynomial  yiy2  •  •  •  yn. 

Given  a  function  g  :  U  1— >  {  — 1,0,1},  extend  it  arbitrarily  to  a  function  g  :  Bn  1— > 
{  —  1,0,1}.  Let  p  be  a  polynomial  in  Y  =  {yi,  y2>  ■■■>  0n}  that  represents  g  on  U  exactly. 
Let  cyityi2  ■  ■  ■  yit  be  a  term  in  p  for  some  constant  c  £  {—1, 1}-  We  show  that  if  t  is  larger 
than  n/2  we  can  replace  this  term  with  a  smaller-degree  term. 

Let  T  =  {y,;, , yVl, . . .  ,yit}  and  T  =  Y  -  T.  The  term  cyuyh  ■  ■  ■  yit  can  be  written 
as  cn  T,  where  by  II T  we  mean  the  product  of  all  terms  in  T.  With  y?  =  1 ,  this  may 
be  rewritten  as  cn  YU  T.  Since  / ^  =  hL,  on  the  set  U  this  is  equivalent  to  cfH  T, 
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which  has  degree  i Jn  +  n  —  |T|.  Thus,  a  term  cy^  yl2  ■  ■  ■  yit  of  degree  t  >  n/ 2  can 
be  replaced  by  a  term  of  degree  yfn  +  n  —  t.  It  follows  that  the  number  of  polynomials 
(and  functions)  representing  functions  whose  values  coincide  with  frf1  on  U  is  the  number 
of  polynomials  of  degree  at  most  yfn  +  n/2.  Since  there  are  (")  ways  to  choose  a  term 
containing  j  variables  of  Y,  there  are  at  most  N  ways  to  choose  polynomials  representing 
functions  <7  :  t/  1 — >  {  —  1,  0, 1},  where  N  satisfies  the  following  bound: 

y/H+(n/2) 

N  —  E 

3=  0 

For  sufficiently  large  n ,  the  bound  to  N  is  approximately  0.9772  ■  2"  <  (49/50)2™.  (See 
Problem  9.7.)  Since  each  of  the  N  terms  can  be  included  in  a  polynomial  with  coefficient 
—  1,  0,  or  1,  there  are  at  most  3N  distinct  polynomials  and  corresponding  functions  g  : 

U  1— >  {—1,  0,  1},  which  is  the  desired  conclusion.  ■ 

We  summarize  these  two  results  in  Theorem  9.7.4. 

THEOREM  9.7.4  Every  circuit  of  depth  d  for  the  parity  function  /E  has  a  size  exceeding  2™  /  /2/50 
for  sufficiently  large  n. 

Proof  Let  U  be  the  set  of  n-tuples  on  which  and  its  approximation  /  differ.  From 
Lemma  9.7.6,  \U\  is  at  most  size(C')2™_fe.  Now  let  k  =  nl^2d /l.  From  Lemma  9.7.7  these 
two  functions  must  differ  on  at  least  -yfln  input  n-tuples.  Thus,  size((7)2™^fc  >  gg2n  from 
which  the  conclusion  follows.  ■ 


Problems 

MATHEMATICAL  PRELIMINARIES 

9. 1  Show  that  the  following  identity  holds  for  integers  r  and  L: 


L 

+ 

rL 

r  +  1 

r  +  1 

9.2  Show  that  a  rooted  tree  of  maximal  fan-in  r  containing  k  internal  vertices  has  at  most 
k(r  —  1)  +  1  leaves  and  that  a  rooted  tree  with  l  leaves  and  fan-in  r  has  at  most  l  —  1 
vertices  with  fan-in  2  or  more  and  at  most  2(1  —  1)  edges. 

9.3  For  positive  integers  ni,  n.2,  a\,  and  (12,  show  that  the  following  identity  holds: 

n\  n|  >  (n\  +  n2 )2 
a\  a2  ~  (a\  +  a2) 

9.4  The  external  path  length  e(T,  L )  of  a  binary  tree  T  with  L  leaves  is  the  sum  of  the 
lengths  of  the  paths  from  the  root  to  the  leaves.  Show  that  e(T,  L)  >  L[log2  L~\  — 
2  rioS2  L~]  +  L. 
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Hint:  Argue  that  the  external  path  length  is  minimal  for  a  nearly  balanced  binary  tree. 
Use  this  fact  and  a  proof  by  induction  to  obtain  the  external  path  length  of  a  binary 
tree  with  L  =  2k  for  some  integer  k.  Use  this  result  to  establish  the  above  statement. 

9.5  For  positive  integers  r  and  s,  show  that  [~s/r]  (s  mod  r )  +  \  s/r J  (r  —  s  mod  r)  =  s. 
Hint:  Use  the  fact  that  for  any  real  number  a,\a\  —  [_aj  =  1  if  a  is  not  an  integer  and 
0  otherwise.  Also  use  the  fact  that  s  mod  r  =  s  —  \  s/r\  ■  r. 

9.6  (Binomial  Theorem)  Show  that  the  coefficient  of  the  term  xlyn~l  in  the  expansion  of 
the  polynomial  [x  +  y)n  is  the  binomial  coefficient  (™).  That  is, 

(*+i/)n  =  E 

i=0 

9.7  Show  that  the  following  sum  is  closely  approximated  by  0.4772  •  2"  for  large  n: 

(n/2)  +  v/'i 

i=(n/2) 

Hint:  Use  the  fact  that  n!  can  be  very  closely  approximated  by  \/2nn  nne~n  to  ap¬ 
proximate  (") .  Then  approximate  a  sum  by  an  integral  (see  Problem  2.23)  and  consult 
tables  of  values  for  the  error  function  erf (x)  =  J(|T  e  1  dt. 

9.8  Let  0  <  x  <  y.  Show  that  x  +  sjy  —  x  >  ^Jy . 

CIRCUIT  MODELS  AND  MEASURES 

9.9  Provide  an  algorithm  that  produces  a  formula  for  each  circuit  of  fan-out  1  over  a  basis 
that  has  fan-in  of  at  most  2. 

9.10  Show  that  any  monotone  Boolean  function  :  Bn  i— >  B  can  be  expanded  on  its 
first  variable  as 

f(x  l,X2,..  .,Xn)  =  /( 0,X2,  .  V  (Xi  A  f(l,X2,  ■  ■  .,Xn)) 

9.11  Show  that  a  circuit  for  a  Boolean  function  (one  output  vertex)  over  the  standard  basis 
can  be  transformed  into  one  that  uses  negation  only  on  inputs  by  at  most  doubling  the 
number  of  AND,  OR,  and  NOT  gates  and  without  changing  its  depth  by  more  than  a 
constant  factor. 

Hint:  Find  the  two-input  gate  closest  to  the  output  gate  that  is  connected  to  a  NOT 
gate.  Change  the  circuit  to  move  the  NOT  gate  closer  to  the  inputs. 

RELATIONSHIPS  AMONG  COMPLEXITY  MEASURES 

9.12  Using  the  construction  employed  in  Theorem  9.2.1,  show  that  the  depth  of  a  function 
/  :  Bn  i— >  Bm  in  a  circuit  of  fan-out  s  over  a  complete  basis  fl  of  fan-in  r  satisfies  the 
inequality 


Ds,n(f)  <  Da{f)  (1  +  log,  ( rC.M/D )) 
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9.13  Show  that  there  are  ten  functions  /  with  Lq(/)  =  2  that  are  dependent  on  two 
variables  and  that  each  can  be  realized  from  a  circuit  for  /mux  plus  at  most  one  instance 
of  NOT  on  an  input  to  /mux  and  on  its  output. 

9.14  Extend  the  upper  bound  on  depth  versus  formula  size  of  Theorem  9.2.2  to  monotone 
functions. 


LOWER-BOUND  METHODS  FOR  GENERAL  CIRCUITS 

9.15  Show  that  the  function  f(x\,  X2,  ■  ■  ■ ,  xn)  =  X\  A  Xn  A  •  •  •  A  xn  has  circuit  size  |~(n  — 
1  )/(r  —  1)]  and  depth  |~logr  n]  over  the  basis  containing  the  r-input  AND  gate. 

9.16  The  parity  function  :  Bn  1 — >  B  has  value  1  when  an  odd  number  of  its  variables 
have  value  1  and  0  otherwise.  Derive  matching  upper  and  lower  bounds  on  the  size 
and  depth  of  the  smallest  and  shallowest  circuit(s)  for  over  the  basis  f?2- 

9.17  Show  that  the  function  /  mod  4  defined  to  have  value  1  if  the  sum  of  the  n  inputs 
modulo  4  is  1  can  be  realized  by  a  circuit  over  the  basis  B2  whose  size  is  2.5 n  +  0(1). 
Hint:  Show  that  the  function  is  symmetric  and  devise  a  circuit  to  compute  the  sum  of 
three  bits  as  the  sum  of  two  bits. 

9.18  Over  the  basis  B2  derive  good  upper  and  lower  bounds  on  the  circuit  size  of  the  func¬ 
tions  f^n>  :  Bn  1— >  B  and  :  Bn  1— >  B  defined  as 

/ ^  =  (( y  +  2)  mod  4)  mod  2 
/j")  =  (( y  +  2)  mod  5)  mod  2 

Here  y  =  Xi  an<^  and  +  denote  integer  addition. 

9.19  Show  that  the  set  of  Boolean  functions  on  two  variables  that  depend  on  both  variables 
contains  only  AND-type  and  parity-type  functions.  Here  an  AND-type  function  com¬ 
putes  ( xaAyb)c  for  Boolean  constants  a,  b,  c  whereas  a  parity-type  function  computes 
x  ©  y  ®  c  for  some  Boolean  constant  c. 

9.20  The  threshold  function  rj:n'1  :  Bn  1— >  B  on  n  inputs  has  value  is  1  if  t  or  more  inputs 
are  1  and  0  otherwise.  Show  that  over  the  basis  B2  that  Cb2(t 2^)  >  2n  —  4. 

9.21  A  formula  for  the  parity  function  /^  :  Bn  >—>  B  on  n  inputs  is  given  below.  Show 
that  it  has  circuit  size  exactly  3  (n  —  1)  over  the  standard  basis  when  NOT  gates  are  not 
counted: 

/©"c  =  Xl  ©  %2  ®  ■  •  •  ©  xn  ®  C 

9.22  Show  that  has  circuit  size  exactly  4(n  —  1)  over  the  standard  basis  when  NOT  gates 
are  counted. 

9.23  Show  that  has  circuit  size  exactly  7(n  —  1)  over  the  basis  {A,  ->}. 
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LOWER  BOUNDS  TO  FORMULA  SIZE 

9.24  Show  that  the  multiplexer  function  /nfux  can  be  realized  by  a  formula  of  size  32p  —  2 
in  which  the  total  number  of  address  variables  is  2(2P  —  1). 

Hint:  Expand  the  function  /,!-©  as  suggested  below,  where  odfe)  denotes  the  k  com¬ 
ponents  of  a  with  smallest  index  and  P  =  7P\ 

/mux ©(p) .  yp-u  ■  ■  ■ ,  Vo)  =  f£L(aP- 1 .  /mux:)  {a(p~l),yP- 1 , . . . ,  yP/2) , 

fmux1)(a<'P~1)’yP/2- 1.  •  •  ->2/o)) 

Also,  represent  /il  as  shown  below. 

/mux©  Ui’Uo)  =  (a  A  2/o)  V  (a  A  t/i) 

9.25  Show  that  Neciporuk’s  method  cannot  provide  a  lower  bound  larger  than  0(n2 /  log  n) 
for  a  function  on  n  variables. 

9.26  Derive  a  quadratic  upper  bound  on  the  formula  size  of  the  parity  function  / ^  over 
the  standard  basis. 

9.27  Neciporuk’s  function  is  defined  in  terms  of  an  \n/m  \  x  m  matrix  of  Boolean  variables, 
X  =  {xij},  m  =  |"log2  n\  +  2,  and  a  matrix  S  =  {er,©  of  the  same  dimen¬ 
sions  in  which  each  entry  <Jij  is  a  distinct  m- tuple  over  B  containing  at  least  two  Is. 
Neciporuk’s  function,  N( X),  is  defined  as 

n(x) = ®  xi,j  a  ©  n  Xk’1 

i.j  ,k  =  ]  i  such  that 

o-jjt  |)=1 

Here  ©  denotes  the  exclusive  or  operation.  Show  that  this  function  has  formula  size 
fl(n2/logn)  over  the  basis  B2. 

9.28  Use  Krapchenko’s  method  to  derive  a  lower  bound  of  n2  on  the  formula  size  of  the 
parity  function  :  Bn  ^  B. 

9.29  Use  Krapchenko’s  method  to  derive  a  lower  bound  of  £l(t(n  —  t  +  1))  on  the  formula 
size  over  the  standard  basis  of  the  threshold  function  rj7'1 ,1  <  t  <  n  —  1. 

9.30  Generalize  Krapchenko’s  lower-bound  method  as  follows.  Let  /  :  Bn  i— >  B  and  let 
A  C  /“ 1  (0)  and  B  C  /_1(  1).  Let  Q  =  [qij]  be  defined  by  qltj  =  1  if  Xi  £  A  and 
Xj  £  B  are  neighbors  and  =  0  otherwise.  Let  P  =  QQT  and  P  =  QTQ.  Then 
Pr:S  is  the  number  of  common  neighbors  to  xr  and  xs  in  B.  The  matrices  P  and 
P  are  symmetric  and  their  largest  eigenvalues,  A (P)  and  A (P),  are  both  non-negative 
and  A (P)  =  A (P).  Show  that 


Ln(f)  >  HP) 


9.31  Under  the  conditions  of  Problem  9.30,  let 


DU)  = 

v,s 


\B\  ^Pr’s’ 


K{f) 


W{A,B)\2 

\A\\B\ 
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where  K(f)  is  the  lower  bound  given  in  Theorem  9.4.2.  Show  that 

K(f)  <  D(f)  <  A (P) 

K{f)  <  D(f)  <  X(P) 

Hint:  Use  the  fact  that  the  largest  eigenvalue  of  a  matrix  P  satisfies 

xT  Px 

A  (P)  =  max 


x/o  xTx 

Also,  let  Si  be  the  sum  of  the  elements  in  the  ith  column  of  the  matrix  Q.  Show  that 

Si  Si  =  Er,s  Pr,s- 


LOWER-BOUND  METHODS  FOR  MONOTONE  CIRCUITS 

9.32  Consider  a  monotone  circuit  on  n  inputs  that  computes  a  monotone  Boolean  function 
/  :  Bn  i— >  B.  Let  the  circuit  have  k  two-input  AND  gates,  one  of  them  the  output  gate, 
and  let  these  gates  compute  the  Boolean  functions  gi,  gi,  ■  ■  ■ ,  gu  =  /,  where  the  AND 
gates  are  inverse-ordered  by  their  distance  from  the  output  gate  computing  /.  Since  the 
function  gj  is  computed  using  the  values  of  X\,  Xn,  •  •  ■ ,  xn,  g\, . . . ,  gj-\,  show  that  gj 
can  be  computed  using  at  most  n+j  —  2  two-input  OR  gates  and  one  AND  gate.  Show 
that  this  implies  the  following  upper  bound  on  the  monotone  circuit  size  of  / : 

Cnmon(/)  <kn+(^  2  -  1 

Let  C/\  (/)  denote  the  minimum  number  of  AND  gates  used  to  realize  /  over  the  mono¬ 
tone  basis.  This  result  implies  the  following  relationship: 

CnmoAf)  =  O  ((CM))2) 

How  does  this  result  change  if  the  gate  associated  with  /  is  an  OR  gate? 

9.33  Show  that  the  prime  implicants  of  a  monotone  function  are  monotone  prime  impli- 
cants. 

9.34  Find  the  monotone  implicants  of  the  Boolean  threshold  function  ri'1'  :  Bn  i— ►  B, 

1  <t<n. 

9.35  Using  the  gate-elimination  method,  show  that  Unmon(T2n^)  >  2n  —  3. 

9.36  Show  that  an  expansion  of  the  form  of  equation  (9.1)  on  page  420  holds  for  every 
monotone  function. 

9.37  Show  that  the  f^que  k  :  g  can  be  realized  by  a  monotone  circuit  of  size 

0[nn). 

9.38  Show  that  the  largest  value  assumed  by  min(-\/fc  —  1  /2,  n/(2k))  under  variation  of  k 
is 
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CIRCUIT  DEPTH 

9.39  Show  that  the  communication  complexity  of  a  problem  (U ,  V),  U,  V  C  Bn,  satisfies 
C(U ,  V)  <  n+log2  n,  where  log2  n  is  the  number  of  times  that  |"log2]  must  be  taken 
to  reduce  n  to  zero. 

Hint:  Complete  the  definition  of  a  protocol  in  which  Player  I  sends  Player  II  n  — 
[log2  n\  bits  on  the  first  round  and  Player  II  responds  with  a  message  specifying 
whether  or  not  its  n-tuple  agrees  with  that  of  Player  I  and  if  not,  where  they  differ. 

9.40  Consider  the  communication  problem  defined  by  the  following  sets: 

U  =  {u  |  3  divides  the  number  of  Is  in  u} 

V  =  {v  |  3  does  not  divide  the  number  of  Is  in  u} 

Show  that  a  protocol  exists  that  solves  this  problem  with  communication  complexity 

3  [log2  ri\. 

9.41  Show  that  Theorem  9.7.4  continues  to  hold  when  the  MOD3  function  is  added  to  the 
basis  where  MOD3  is  the  Boolean  function  that  has  value  1  when  the  number  of  Is 
among  its  inputs  is  not  divisible  by  3. 

Chapter  Notes 

The  dependence  of  circuit  size  on  fan-out  stated  in  Theorem  9.2.1  is  due  to  Johnson  et  al. 
[150].  The  depth  bound  implied  by  this  result  is  proportional  to  the  product  of  the  depth  and 
the  logarithm  of  the  size  of  the  original  circuit.  Hoover  et  al.  [138]  have  improved  the  depth 
bound  so  that  it  is  proportional  to  (logr  s)Dq(J)  without  sacrificing  the  size  bound  of  [150]. 

The  relationship  between  formula  size  and  depth  in  Theorem  9.2.2  is  due  to  Spira  [314], 
whose  depth  bound  has  a  coefficient  of  proportionality  of  2.465  over  the  basis  of  all  Boolean 
functions  on  two  variables.  Over  the  basis  of  all  Boolean  functions  except  for  parity  and  its 
complement,  Preparata  and  Muller  [259]  obtain  a  coefficient  of  1.81.  Brent,  in  a  paper  on  the 
parallelization  of  arithmetic  formulas  [58],  has  effectively  extended  the  relationship  between 
depth  and  formula  size  to  monotone  functions.  (See  also  [359].) 

An  interesting  relationship  between  complexity  measures  that  is  omitted  from  Section  9.2, 
due  to  Paterson  and  Valiant  [240] ,  shows  that  circuit  size  and  depth  satisfy  the  inequality 

dm  >  \cm*kCM-o{cm) 

The  lower  bounds  of  Theorem  9.3.2  on  functions  in  Q 2^'  are  due  to  Schnorr  [300], 
whereas  that  of  Theorem  9.3.3  on  the  multiplexer  function  is  due  to  Paul  [244].  Blum  [48], 
building  on  the  work  of  Schnorr  [302] ,  has  obtained  a  lower  bound  of  3  (n  —  1 )  for  a  particular 
function  of  n  variables  over  the  basis  B2.  This  is  the  best  circuit-size  lower  bound  for  this 
basis.  Zwick  [374]  has  obtained  a  lower  bound  of  4 n  for  certain  symmetric  functions  over  the 
basis  U2.  Red’kin  [274]  has  obtained  lower  bounds  with  coefficients  as  high  as  7  for  certain 
functions  over  the  bases  {A,  ©  and  {V,  ->}.  (See  Problem  9.23.)  Red’kin  [276]  has  used  the 
gate-elimination  method  to  show  that  the  size  of  the  ripple-adder  circuit  of  Section  2.7  cannot 
be  improved. 
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The  coefficient  of  Neciporuk’s  lower-bound  method  [230]  in  Theorem  9.4.1  has  been  im¬ 
proved  upon  by  Paterson  (unpublished)  and  Zwick  [373].  Paul  [244]  has  applied  Neciporuk’s 
method  to  show  that  the  indirect  storage  access  function  has  formula  size  fi(n2 /  logn)  over 
the  basis  i?2-  Neciporuk’s  method  has  also  been  applied  to  many  other  problems,  including  the 
determinant  [169],  the  marriage  problem  [126],  recognition  of  context-free  languages  [241], 
and  the  clique  function  [304] . 

The  proof  of  Krapchenko’s  lower  bound  [174]  given  in  Theorem  9.4.2  is  due  to  Pater¬ 
son,  as  described  by  Bopanna  and  Sipser  [50],  Koutsoupias  [172]  has  obtained  the  results  of 
Problems  9.30  and  9.31,  improving  upon  the  Krapchenko  lower  bounds  for  the  fcth  thresh¬ 
old  function  by  a  factor  of  at  least  2.  Andreev  [24],  building  on  the  work  of  Subbotovskaya 
[320],  has  improved  upon  Krapchenko’s  method  and  exhibits  a  lower  bound  of  fl(n2'5~e)  on 
a  function  of  n  variables  for  every  fixed  e  >  0  when  n  is  sufficiently  large.  Krichevskii  [176] 
has  shown  that  over  the  standard  basis,  t^'1  requires  formula  size  C2(nlogn),  which  beats 
Krapchenko’s  lower  bound  for  small  and  large  values  of  t. 

Symmetric  functions  are  examined  in  Section  2.11  and  upper  bounds  are  given  on  the 
circuit  size  of  such  functions  over  the  basis  {A,  V,  ©}.  Polynomial-size  formulas  for  symmet¬ 
ric  functions  are  implicit  in  the  work  of  Ofman  [234]  and  Wallace  [356],  who  also  indepen¬ 
dently  demonstrated  how  to  add  two  binary  numbers  in  logarithmic  depth.  Krapchenko  [175] 
demonstrated  that  all  symmetric  Boolean  functions  have  formula  size  0(ri4'93)  over  the  stan¬ 
dard  basis.  Peterson  [247],  improving  upon  the  results  of  Pippenger  [248]  and  Paterson  [241], 
showed  that  all  symmetric  functions  have  formula  size  0(n3'27)  over  the  basis  f?2-  Paterson, 
Pippenger,  and  Zwick  [242,243]  have  recently  improved  these  results,  showing  that  over  B2 
and  U2  formulas  exist  of  size  0(n3'13)  and  0(n4’57),  respectively,  for  many  symmetric  Boolean 
functions  including  the  majority  function,  and  of  size  0(n3'30)  and  0(n4'85),  respectively,  for 
all  symmetric  Boolean  functions. 

Markov  demonstrated  that  the  minimal  number  of  negations  needed  to  realize  an  arbitrary 
binary  function  on  n  variables  with  an  arbitrary  number  of  output  variables,  maximized  over 
all  such  functions,  is  at  most  |~log2(n  +  1)].  For  Boolean  functions  (they  have  one  output 
variable)  it  is  at  most  Llog2(n  +  1)J .  Fischer  [100]  has  described  a  circuit  whose  size  is  at  most 
twice  that  of  an  optimal  circuit  plus  the  size  of  a  circuit  that  computes  /neg(£1;  •  ■  • ,  xn)  = 
(xu  . . . ,  x„ )  and  whose  depth  is  at  most  that  of  the  optimal  circuit  plus  the  depth  of  a  circuit 
for  /neg-  He  exhibits  a  circuit  for  /neg  of  size  0(n2  log  n )  and  depth  O(logn).  This  is 
the  result  given  in  Theorem  9.5.1.  Tanaka  and  Nishino  [323]  have  improved  the  size  bound 
on  /neg  to  0(n  log2  n)  at  the  expense  of  increasing  the  depth  bound  to  0( log2  n).  Beals, 
Nishino,  and  Tanaka  [32]  have  further  improved  these  results,  deriving  simultaneous  size  and 
depth  bounds  of  O(nlogn)  and  O(logn),  respectively. 

Using  non-constructive  methods,  a  series  of  upper  bounds  have  been  developed  on  the 
monotone  formula  size  of  the  threshold  functions  by  Valiant  [346]  and  Bopanna  [49], 
culminating  in  bounds  by  Khasin  [166]  and  Friedman  [106]  of  0(f4'3n  log  n)  over  the  mono¬ 
tone  basis.  With  constructive  methods,  Ajtai,  Komlos,  and  Szemeredi  [14]  obtained  polyno¬ 
mial  bounds  on  the  formula  size  Tj  ”4  over  the  monotone  basis.  Using  their  construction,  Fried¬ 
man  [106]  has  obtained  a  bound  on  formula  size  over  the  monotone  basis  of  0(tcn  log  n)  for 
c  a  large  constant. 

Over  the  basis  B2,  Fischer,  Meyer,  and  Paterson  [101]  have  shown  that  the  majority  func¬ 
tion  rjn\  t  =  \n/ 2] ,  and  other  symmetric  functions  require  formula  size  fl(nlogn).  Pudlak 
[264],  building  on  the  work  of  Hodes  and  Specker  [136],  has  shown  that  all  but  16  symmetric 
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Boolean  functions  on  n  variables  require  formula  size  f l(n  log  log  n)  over  the  same  basis.  The 
16  exceptional  functions  have  linear  formula  size. 

Using  counting  arguments  such  as  those  given  in  Section  2.12,  Gilbert  [114]  has  shown 
that  most  monotone  Boolean  functions  on  n  variables  have  a  circuit  size  that  is  fl(2"/n3/2). 
Red’kin  [275]  has  shown  that  the  lower  bound  can  be  achieved  to  within  a  constant  multi¬ 
plicative  factor  by  every  monotone  Boolean  function. 

Tiekenherinrich  [330]  gave  a  An  lower  bound  to  the  monotone  circuit  size  of  a  simple 
function.  Dunne  [87]  derived  a  3.5 n  lower  bound  on  the  monotone  circuit  size  for  the  major¬ 
ity  function. 

The  lower  bound  on  the  monotone  circuit  size  of  binary  sorting  (Theorem  9.6.1)  is  due 
to  Lamagna  and  Savage  [188]  using  an  argument  patterned  after  that  of  Van  Voorhis  [351]  for 
comparator-based  sorting  networks.  Muller  and  Preparata  [225,226]  demonstrate  that  binary 
sorting  over  the  standard  basis  has  circuit  size  O(n).  (See  Theorem  2.11.1.)  Pippenger  and 
Valiant  [253]  and  Lamagna  [187]  demonstrate  an  fl(nlogn)  lower  bound  on  the  monotone 
circuit  size  of  merging.  These  results  are  established  in  Section  9.6.1.  The  sorting  network 
designed  by  Ajtai,  Komlos,  and  Szemeredi  [14]  when  specialized  to  Boolean  data  yields  a 
monotone  circuit  of  size  0(n  log  n)  for  binary  sorting. 

The  first  proof  that  the  monotone  circuit  size  of  n  x  n  Boolean  matrix  multiplication 
(see  Section  9.6.2)  is  fi(ro3)  was  obtained  by  Pratt  [256],  Later  Paterson  [238]  and  Mehlhorn 
and  Galil  [218]  demonstrated  that  it  is  exactly  n2(2n  —  1).  Weiss  [361]  discovered  a  simple 
application  of  the  function-replacement  method  to  both  Boolean  convolution  and  Boolean 
matrix  multiplication,  as  summarized  in  Corollary  9.6.1  and  Theorem  9.6.5.  (Wegener  [360, 
p.  170]  extended  Weiss’s  result  to  include  the  number  of  ORs.)  Wegener  [357]  has  exhibited  an 
n- input,  n-output  Boolean  function  (Boolean  direct  product)  whose  monotone  circuit  size  is 
fl(n2) .  Earlier  several  authors  examined  the  class  of  multi-output  functions  known  as  Boolean 
sums  in  which  each  output  is  the  OR  of  a  subset  of  inputs.  Neciporuk  [231]  gave  an  explicit 
set  of  Boolean  sums  and  demonstrated  that  its  monotone  circuit  size  is  fl(n3/2).  This  lower 
bound  for  such  functions  was  independently  improved  to  f l(n5/3)  by  Mehlhorn  [216]  and 
Pippenger  [250],  More  recently,  Andreev  [23]  has  constructed  a  family  of  Boolean  sums  with 
monotone  circuit  size  that  is  f l(ro2~e)  for  every  fixed  e  >  0. 

The  first  super-polynomial  lower  bound  on  the  monotone  circuit  size  of  the  clique  function 
was  established  by  Razborov  [270].  Shortly  afterward,  Andreev  [22],  using  similar  methods, 
gave  an  exponential  lower  bound  on  the  monotone  circuit  size  of  a  problem  in  NP.  Because  the 
clique  function  is  complete  with  respect  to  monotone  projections  [310,344],  this  established 
an  exponential  lower  bound  for  the  clique  function.  Alon  and  Bopanna  [17],  by  strengthen¬ 
ing  Razborov’s  method,  gave  a  direct  proof  of  this  fact,  giving  a  lower  bound  exponential  in 
Q  ((n/  log  n)1' 3) .  The  stronger  lower  bound  given  in  Theorem  9.6.6,  which  is  exponential 
in  fl(n 1/3),  is  due  to  Amano  and  Maruoka  [20] .  They  apply  bottleneck  counting,  an  idea  of 
Haken  [125],  to  establish  this  result.  Amano  and  Maruoka  [20]  have  also  extended  the  approx¬ 
imation  method  to  circuits  that  have  negations  only  on  their  inputs  and  for  which  the  number 
of  inputs  carrying  negations  is  small.  They  show  that,  even  with  a  small  number  of  negations, 
an  exponential  lower  bound  on  the  circuit  size  of  the  clique  function  can  be  obtained. 

Having  shown  that  monotone  circuit  complexity  can  lead  to  exponential  lower  bounds, 
Razborov  [271]  then  cast  doubt  on  the  likelihood  that  this  approach  would  lead  to  exponential 
non-monotone  circuit  size  bounds  by  proving  that  the  matching  problem  on  bipartite  graphs, 
a  problem  in  P,  has  a  super-polynomial  monotone  circuit  size.  Tardos  [324]  strengthened 
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Razborov’s  lower  bound,  deriving  an  exponential  one.  Later  Razborov  [273]  demonstrated 
that  the  obvious  generalization  of  the  approximation  method  cannot  yield  better  lower  bounds 
than  f l(n2)  for  Boolean  functions  on  n  inputs  realized  by  circuits  over  complete  bases. 

Berkowitz  [37]  introduced  the  concept  of  pseudo-inverse  and  established  Theorem  9.6.9. 
Valiant  [347],  Wegener  [358],  and  Paterson  (unpublished  —  see  [92,360])  independently  im¬ 
proved  upon  the  size  of  the  monotone  circuit  realizing  all  pseudo-negations  from  0(n 2  log  n) 
to  0(n  log2  n)  to  produce  Theorem  9.6.8.  Lemma  9.6.9  is  due  to  Dunne  [90]. 

In  his  Ph.D.  thesis  Dunne  [88]  has  given  the  most  general  definition  of  pseudo-negation. 
He  shows  that  a  Boolean  function  h  is  a  pseudo-negation  on  variable  Xi  of  a  Boolean  function 
/  on  the  n  variables  X\, . . . ,  xn  if  and  only  if  h  satisfies 

/(•X*)  \xi  =0  ^  h(x  i ,  .  .  .  ,  Xi—  1,  Xi-^-i,  .  .  . ,  Xn')  V  f  (x')\Xi=l 

Here  f(x)\Xi—a  denotes  the  function  obtained  from  /  by  fixing  Xi  at  a. 

Dunne  [89]  demonstrated  that  HALF-CLIQUE  CENTRAL  SLICE  is  NP  -complete  (The¬ 
orem  9.6.10)  and  showed  that  the  central  slices  of  the  HAMILTONIAN  CIRCUIT  (there  is  a 
closed  path  containing  each  vertex  once)  and  SATISFIABILITY  are  NP-complete.  As  men¬ 
tioned  by  Dunne  [91],  not  all  NP-complete  problems  have  NP-complete  central  slices. 

The  concept  of  communication  complexity  arose  in  the  context  of  the  VLSI  model  of 
computation  discussed  in  Chapter  12.  In  this  case  it  measures  the  amount  of  information  that 
must  be  transmitted  from  the  inputs  to  the  outputs  of  a  function.  The  communication  game 
described  in  Section  9.7.1  is  different:  it  characterizes  a  search  problem  because  its  goal  is  to 
find  an  input  variable  on  which  two  n-tuples  in  disjoint  sets  disagree. 

Yao  [366]  developed  a  method  to  derive  lower  bounds  on  the  communication  complexity 
of  functions  /  :  X  X  Y  t— >  Z.  He  considered  the  matrix  of  values  of  /  where  the  rows 
and  columns  are  indexed  by  the  values  of  X  and  Y .  He  defined  monochromatic  rectangles 
as  submatrices  in  which  all  entries  are  the  same.  He  then  established  that  the  logarithm  of 
the  minimal  number  of  disjoint  rectangles  in  this  matrix  is  a  lower  bound  on  the  number  of 
bits  that  must  be  exchanged  to  compute  /.  (This  result  shows,  for  example,  that  the  identity 
function  /  :  B2n  i— >  B  defined  for  f(x,  y)  =  1  if  and  only  if  x,  =  yi  for  all  1  <  i  <  n 
requires  the  exchange  of  at  least  n  +  1  bits.)  Savage  [288]  adapted  the  crossing  sequence 
argument  from  one-tape  Turing  machines  (an  application  of  the  pigeonhole  principle)  to  derive 
lower  bounds  on  predicates.  Mehlhorn  and  Schmidt  [220]  show  that  functions  /  :  I  X  f  h 
Z  for  which  Z  is  a  subset  of  a  field  have  a  communication  complexity  that  is  at  most  the  rank 
of  the  two-dimensional  matrix  of  values  of  /. 

The  development  of  the  relationship  between  the  circuit  depth  of  a  function  and  its  com¬ 
munication  complexity  follows  that  given  by  Karchmer  and  Wigderson  [157].  Karchmer  [156] 
cites  Yannakakis  for  independently  discovering  the  connection  .Dq0(/)  =  C(/-1(0),  /_1(1)) 
of  Theorem  9.7.1  for  non-monotone  functions.  Karchmer  and  Wigderson  [157]  have  exam¬ 
ined  si-connectivity  in  this  framework.  This  is  the  problem  of  determining  from  the  adja¬ 
cency  matrix  of  an  undirected  graph  G  with  n  vertices  and  two  distinguished  vertices,  s  and 
f,  whether  there  is  a  path  from  s  to  i.  When  characterized  as  a  Boolean  function  on  the  edge 
variables,  this  is  a  monotone  function.  Karchmer  and  Wigderson  [157]  have  shown  that  the 
circuit  depth  of  this  function  is  fl((log  n)2  /  log  log  n),  a  result  later  improved  to  fl((log  n)2) 
independently  by  Hastad  and  Boppana  in  unpublished  work.  Raz  and  Wigderson  [269]  have 
shown  via  a  complex  proof  that  the  clique  problem  on  n-vertex  graphs  studied  in  Section  9.7.4 
has  monotone  communication  complexity  and  depth  f l(n).  The  simpler  but  weaker  lower 
bound  for  this  problem  developed  in  Section  9.7.4  is  due  to  Goldmann  and  Hastad  [116]. 
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Furst,  Saxe,  and  Sipser  [107]  and,  independently,  Ajtai  [13]  obtained  the  first  strong  lower 
bounds  on  the  size  of  bounded-depth  circuits.  They  demonstrated  that  every  bounded-depth 
circuit  for  the  parity  function  /©  has  superpolynomial  size.  Using  a  deeper  analysis,  Yao 
[368]  demonstrated  that  bounded-depth  circuits  for  /©  have  exponential  size.  Hastad  [124] 
strengthened  the  results  and  simplified  the  argument,  giving  a  lower  bound  on  circuit  size  of 
2n(n1/‘i/10)  for  circuits  of  depth  (L 

Razborov  [272]  examined  a  more  powerful  class  of  bounded-depth  circuits,  namely,  cir¬ 
cuits  that  use  unbounded  fan-in  AND,  OR,  and  parity  functions.  He  demonstrated  that  the 
majority  function  has  exponential  size  over  this  larger  basis.  Smolensky  [313]  simplified 
and  strengthened  Razborov’s  result,  obtaining  an  exponential  lower  bound  on  the  size  of  a 
bounded-depth  circuit  for  the  MODp  function  over  the  basis  AND,  OR,  and  MODg  when  p 
and  q  are  distinct  powers  of  primes.  We  use  a  simplified  version  of  his  result  in  Section  9.7.5. 


Space— Time  Tradeoffs 


An  important  question  in  the  study  of  computation  is  how  best  to  use  the  registers  of  a  CPU 
and/or  the  random-access  memory  of  a  general-purpose  computer.  In  most  computations,  the 
number  of  registers  (space)  available  is  insufficient  to  hold  all  the  data  on  which  a  program 
operates  and  registers  must  be  reused.  If  the  space  is  increased,  the  number  of  computation 
steps  (time)  can  generally  be  reduced.  This  is  an  example  of  a  space-versus-time  tradeoff.  In 
this  chapter  we  examine  tradeoffs  between  the  number  of  storage  locations  and  computation 
time  using  the  pebble  game  and  the  branching  program  model. 

The  pebble  game  assumes  that  computations  are  done  with  straight-line  programs  in  a 
data-independent  fashion.  Each  such  program  is  modeled  by  a  directed  acyclic  graph.  A 
pebble  on  a  vertex  indicates  that  its  value  is  in  a  register.  The  goal  of  the  game  is  to  pebble  the 
output  vertices  of  the  graph  with  numbers  of  pebbles  (space)  and  steps  (time)  that  are  minimal, 
that  is,  neither  can  be  reduced  without  increasing  the  other. 

A  branching  program  models  data-dependent  computation  under  the  assumption  that  in¬ 
put  variables  assume  a  bounded  number  of  values.  Such  a  program  is  defined  by  a  directed 
acyclic  multigraph  (there  may  be  more  than  one  edge  between  vertices)  that  specifies  the  order 
in  which  inputs  are  read.  Time  is  the  length  of  the  longest  path  in  a  multigraph  and  space  is 
the  logarithm  of  its  number  of  vertices. 

For  both  models  we  present  techniques  to  derive  lower  bounds  on  the  exchange  of  space  S 
for  time  T.  For  most  problems  examined  here  these  exchanges  are  of  the  form  ST  =  fl(n2), 
where  n  is  the  size  of  the  problem  input.  Upper  bounds  on  ST  are  obtained  by  evaluating  S 
and  T  for  particular  algorithms. 

Because  the  branching  program  is  more  general  than  the  pebble  game,  it  is  more  difficult 
to  obtain  good  lower  bounds  with  it,  and  for  this  reason  we  begin  with  the  pebble  game.  In 
addition,  the  pebble  game  is  appropriate  for  problems  such  as  integer  multiplication,  convo¬ 
lution,  and  matrix  multiplication  on  which  only  straight-line  programs  are  used.  For  other 
problems,  such  as  merging  and  sorting,  the  algorithms  used  typically  involve  branching  and 
for  them  the  branching  program  is  the  better  model. 

We  also  exhibit  extreme  results  for  the  pebble  game  by  showing  that  the  time  to  pebble 
some  graphs  goes  from  minimal  to  exponential  in  the  size  of  the  graphs  when  the  number 
of  pebbles  changes  by  1,  a  warning  against  trying  too  hard  to  minimize  the  number  of  CPU 
registers  used  in  a  computation. 
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1 0.1  The  Pebble  Game 

The  pebble  game  is  a  game  played  on  directed  acyclic  graphs  (DAGs),  which  capture  the 
dependencies  of  straight-line  programs  studied  in  Chapters  2  and  6.  Algorithms  for  many 
important  problems,  such  as  the  FFT  and  matrix  multiplication,  are  naturally  computed  by 
straight-line  programs.  In  the  pebble  game  pebbles  are  placed  on  vertices  of  a  DAG  to  indicate 
that  the  value  associated  with  a  vertex  resides  in  a  register.  Pebbles  are  placed  on  vertices  in  a 
data-independent  order. 

In  this  game  a  pebble  can  be  placed  on  an  input  vertex  at  any  time  and  on  any  non-input 
vertex  whose  immediate  predecessor  vertices  carry  pebbles.  The  goal  of  the  game  is  to  place 
pebbles  on  each  output  vertex.  A  pebble  can  be  removed  from  a  vertex,  including  an  output 
vertex,  at  any  time  after  it  has  been  pebbled.  These  rules  are  summarized  below. 

The  rules  of  the  pebble  game  are  the  following: 

•  (Initialization)  A  pebble  can  be  placed  on  an  input  vertex  at  any  time. 

•  (Computation  Step)  A  pebble  can  be  placed  on  (or  moved  to)  any  non-input  vertex  only 
if  all  its  immediate  predecessors  carry  pebbles. 

•  (Pebble  Deletion)  A  pebble  can  be  removed  at  any  time. 

•  (Goal)  Each  output  vertex  must  be  pebbled  at  least  once. 

Placement  of  a  pebble  on  an  input  vertex  models  the  reading  of  input  data.  Placement  of 
a  pebble  on  a  non-input  vertex  corresponds  to  computing  the  value  associated  with  the  vertex. 
The  removal  of  a  pebble  models  the  erasure  or  overwriting  of  the  value  associated  with  the 
vertex  on  which  the  pebble  resides. 

Allowing  pebbles  to  be  placed  on  input  vertices  at  any  time  reflects  the  assumption  that 
inputs  are  readily  available.  (The  multi-level  pebble  game  introduced  in  the  next  chapter 
models  the  case  in  which  each  access  to  secondary  storage  is  expensive.)  The  condition  that 
all  predecessor  vertices  carry  pebbles  when  a  pebble  is  placed  on  a  vertex  models  the  natural 
requirement  that  an  operation  can  be  performed  only  after  all  arguments  of  the  operation 
are  located  in  main  memory.  Moving  (or  sliding)  a  pebble  to  a  vertex  from  an  immediate 
predecessor  reflects  the  design  of  CPUs  that  allow  the  result  of  a  computation  to  be  placed  in 
a  memory  location  holding  an  operand. 

A  pebbling  strategy  is  the  execution  of  the  rules  of  the  pebble  game  on  the  vertices  of  a 
graph.  We  assign  a  step  to  each  placement  of  a  pebble,  ignoring  steps  on  which  pebbles  are 
removed,  and  number  the  steps  consecutively  from  1  to  T,  the  time  or  number  of  steps  in 
the  strategy.  The  space,  S,  used  by  a  pebbling  strategy  is  the  maximum  number  of  pebbles 
it  uses.  The  goal  of  the  pebble  game  is  to  pebble  a  graph  with  values  of  space  and  time  that 
are  minimal;  that  is,  the  space  cannot  be  reduced  for  the  given  value  of  time  and  vice  versa. 
In  general,  it  is  not  possible  to  minimize  space  and  time  simultaneously.  We  derive  upper  and 
lower  bounds  on  the  possible  exchanges  of  space  for  time. 

10.1.1  The  Pebble  Game  Versus  the  Branching  Program 

As  stated  above,  the  branching  program  model  introduced  in  Section  10.9  handles  data- 
dependent  computation,  and  is  thus  a  more  general  model  than  the  pebble  game.  However, 
there  are  three  reasons  to  study  the  pebble  game.  First,  the  branching  program  assumes  that 
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Figure  I  0. 1  An  FFT  graph  F ®  on  n  =  23  inputs.  Input  vertices  are  on  the  bottom;  edges  are 
directed  upward.  Four  pebbles  are  shown  on  the  graph  when  pebbling  the  leftmost  output. 


input  variables  are  held  in  an  auxiliary  random-access  machine  so  that  it  can  access  them  in 
arbitrary  order,  a  condition  not  imposed  on  pebble  games.  It  follows  that  inputs  to  a  pebble 
game  can  be  fetched  in  advance,  since  the  times  at  which  they  are  needed  are  data-independent. 
Second,  lower  bounds  on  the  exchange  of  space  for  time  with  branching  programs  are  harder  to 
obtain  due  to  their  increased  flexibility.  Third,  straight-line  programs  are  used  in  many  prob¬ 
lems,  such  as  integer  multiplication,  convolution,  matrix  multiplication,  and  discrete  Fourier 
transform,  and  the  pebble  game  gives  the  relevant  lower  bounds.  For  other  problems,  such  as 
sorting  and  merging,  the  branching  program  model  is  the  model  of  choice  since  these  problems 
are  typically  solved  with  branching  programs.  We  expand  upon  this  topic  in  Section  10.9.1. 

10.1.2  Playing  the  Pebble  Game 

The  pebble  game  is  illustrated  in  Fig.  10.1  by  pebbling  the  FFT  graph  F ^  with  eight  inputs 
and  24  non-input  vertices.  This  graph  has  the  property  that  the  set  of  paths  from  input  vertices 
to  an  output  vertex  forms  a  complete  balanced  binary  tree.  (See  Fig.  10.2.)  It  follows  that  we 
can  pebble  the  FFT  graph  by  pebbling  each  of  the  trees.  Since  two  of  the  eight  outputs  share 
the  same  tree  at  the  next  lower  level,  we  can  pebble  two  outputs  at  the  same  time. 

Binary  trees  form  an  important  class  of  graphs.  A  complete  balanced  binary  tree  of  depth 
4  is  illustrated  in  Fig.  10.2.  (The  depth  of  a  directed  tree  is  the  number  of  edges  on  the  longest 
path  from  an  input  vertex  to  the  output  (or  root)  vertex.)  This  tree  has  16  input  vertices  and 
one  output  vertex.  A  complete  balanced  binary  tree  of  depth  0,  T(  0),  consists  of  a  single 
vertex.  A  complete  balanced  binary  tree  of  depth  d  >  0,  T(d),  consists  of  a  root  vertex  and 
two  copies  of  T(d  —  1)  whose  root  vertices  each  have  one  edge  directed  from  them  to  the 
root  vertex  of  the  full  tree.  Thus  in  Fig.  10.2  the  complete  balanced  binary  tree  of  depth  four 
T(4)  is  constructed  of  two  copies  of  T(3),  which  in  turn  are  each  constructed  of  two  copies  of 
T( 2),  and  so  on.  It  follows  by  straightforward  induction  that  a  complete  balanced  binary  tree 
of  depth  d  has  2d  inputs  and  2d+1  —  1  vertices.  (See  Problem  10.8.) 
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Figure  10.2  A  complete  balanced  binary  tree  T(4)  of  depth  4  on  16  inputs.  At  least  five 
pebbles  are  needed  to  pebble  it. 


The  binary  tree  of  Fig.  10.2  can  be  pebbled  with  five  pebbles  by  pebbling  the  vertices  in 
the  order  shown.  Five  pebbles  are  needed  at  the  time  when  vertex  27  is  pebbled.  After  one 
pebble  is  moved  to  vertex  30,  the  two  outputs  of  the  FFT  of  Fig.  10.1  to  which  vertices  15  and 
30  are  attached  can  be  pebbled.  This  tree-pebbling  strategy  can  be  repeated  on  all  remaining 
outputs.  It  is  a  general  strategy  for  pebbling  complete  balanced  binary  trees. 

This  pebbling  strategy,  explained  in  detail  in  the  next  section,  demonstrates  that  an  FFT 
graph  onn=  2k  inputs  can  be  pebbled  with  no  more  pebbles  than  are  needed  to  pebble  the 
trees  with  n  leaves  contained  within  it,  namely,  k  +  1 .  In  the  next  section  we  show  that  this 
is  the  minimum  number  of  pebbles  needed  to  pebble  a  complete  balanced  binary  tree  on  2k 
leaves.  This  FFT  pebbling  strategy  for  the  graph  in  Fig.  10.1  pebbles  each  vertex  on  the  third 
and  fourth  levels  once,  each  vertex  on  the  second  level  twice,  and  each  vertex  on  the  first  level 
four  times.  It  is  clear  that  inputs  must  be  repebbled  if  the  minimum  number  of  pebbles  is  used. 
This  is  an  example  of  space-time  tradeoff.  We  shall  derive  a  lower  bound  on  the  exchange  of 
space  for  time  for  this  problem. 

In  the  next  section  we  also  examine  the  minimum  space  required  to  pebble  graphs.  In  the 
subsequent  section  we  describe  a  graph  that  exhibits  an  extreme  tradeoff.  This  graph  requires 
a  pebbling  time  exponential  in  the  size  of  the  graph  when  the  minimum  number  of  pebbles  is 
used  but  can  be  pebbled  with  one  move  per  vertex  if  one  more  pebble  is  available. 

After  studying  extreme  tradeoffs  we  define  a  flow  property  of  functions  that,  if  satisfied, 
implies  a  lower  bound  on  the  product  (S+  \)T  (or  a  related  expression)  involving  the  space  S 
and  time  T  needed  to  compute  such  functions.  This  test  is  used  to  show  that  many  standard 
algorithms  are  optimal  with  respect  to  their  use  of  space  and  time. 

10.2  Space  Lower  Bounds 

In  this  section  we  derive  lower  bounds  on  the  minimum  space  Smin(G)  needed  to  pebble  a 
graph  G  for  balanced  binary  trees,  pyramids,  and  FFT  graphs,  a  representative  set  of  graphs. 
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Any  pebbling  strategy  will  need  to  use  at  least  as  many  pebbles  as  this  minimum  value  of  space. 
It  can  be  shown  that  no  bounded-degree  graph  on  n  vertices  requires  more  than  0{n/  log  n) 
space  (see  Theorem  10.7.1)  and  that  some  graph  requires  space  proportional  to  n/logn  (see 
Theorem  10.8.1). 

Complete  balanced  binary  trees  were  introduced  in  the  previous  section.  We  now  derive  a 
lower  bound  on  the  space  (number  of  pebbles)  needed  to  pebble  them. 

LEMMA  I  0.2. 1  Any  pebbling  strategy  for  the  complete  balanced  binary  tree  of  depth  k,  T(k), 
requires  at  least  Smin(T(k ))  =  k  +  1  pebbles  and  2k+1  —  1  steps.  There  is  a  pebbling  strategy  of 
T(k)  that  uses  exactly  this  many  pebbles  and  steps. 

Proof  Proof  of  the  lemma  requires  a  proof  that  k  +  1  pebbles  are  necessary  as  well  as  a 
strategy  that  pebbles  the  tree  with  k  +  1  pebbles  and  makes  one  pebble  placement  per 
vertex.  Let’s  first  develop  a  pebbling  strategy. 

T(0)  obviously  can  be  pebbled  with  one  pebble  in  one  step.  Assume  that  T(k  —  1)  can 
be  pebbled  with  k  pebbles  in  2k  —  1  steps.  To  pebble  T(k),  advance  a  pebble  to  the  root  of 
its  left  subtree  (a  copy  of  T[k  —  1))  using  k  pebbles  and  2k  —  1  steps.  Leave  a  pebble  on  its 
root.  Then  pebble  the  right  subtree  of  T{k)  using  k  pebbles  and  2k  —  1  steps.  (A  snapshot 
of  T{k)  when  the  number  of  pebbles  is  maximal  under  this  pebbling  strategy  is  shown  in 
Fig.  10.2.)  Thus,  T(k)  is  pebbled  in  2  x  ( 2k  —  1)  +  1  =  2k+l  —  1  steps  with  k+  1  pebbles. 

The  lower  bound  is  derived  by  showing  that  no  pebbling  strategy  can  use  fewer  than 
k  +  1  pebbles.  The  argument  used  is  the  following:  initially  no  path  to  the  root  of  the  tree 
(or  output)  from  input  vertices  carries  a  pebble  because  there  are  no  pebbles  on  the  graph. 
At  the  end  of  the  computation  a  pebble  resides  on  the  root  and  all  paths  to  the  root  carry 
pebbles.  Therefore,  there  must  be  a  first  point  in  time  at  which  there  is  a  pebble  on  each 
path  to  the  root.  This  must  be  a  time  at  which  a  pebble  is  placed  on  an  input  vertex,  thereby 
closing  the  last  path  from  that  input  to  the  root.  Such  a  path  is  highlighted  in  Fig.  10.2. 
Before  a  pebble  is  placed  on  the  input  vertex  of  this  path,  all  other  paths  from  input  vertices 
to  the  root  carry  pebbles.  Each  of  these  paths  enters  the  highlighted  path  via  one  edge.  Thus, 
it  follows  that  prior  to  the  placement  of  this  last  pebble  there  is  at  least  one  pebble  on  the 
tree  for  each  of  the  k  edges  on  this  path  except  for  the  input  vertex.  Consequently,  at  least 
k  +  1  pebbles  are  on  the  tree  when  the  last  pebble  is  placed  on  it.  ■ 

The  FFT  graph  on  2k  inputs,  F^,  is  defined  recursively  in  terms  of  two  sub-FFT  graphs 
F as  shown  in  Section  6.7.2.  It  follows  that  this  graph  contains  many  copies  of  the  tree 
T{k)  as  a  subgraph  (see  Problem  10.11)  and  that  any  pebbling  strategy  for  Fl'-k'i  requires  at 
least  k  +  1  pebbles.  Many  other  straight-line  computations  involve  tree  computations. 

A  pyramid  graph  on  m  inputs,  P(m)  (P(6)  is  shown  in  Fig.  10.3),  is  obtained  by  slicing 
an  m  X  m  mesh  into  two  parts  along  its  diagonal,  splitting  all  diagonal  nodes  (which  are  now 
inputs),  and  then  directing  edges  from  the  diagonal  vertices  in  one  part  to  the  one  remaining 
unsplit  corner  vertex  in  this  part  of  the  graph.  Edges  are  directed  up,  a  convention  we  use 
throughout  this  chapter.  P(m)  has  n  =  m(m  +  l)/2  vertices.  (See  Problem  10.1.) 

We  apply  to  the  pyramid  graph  P(m)  the  lower  bounding  argument  used  in  the  preceding 
proof  based  on  closing  the  last  open  path  to  the  output  vertex. 

LEMMA  10.2.2  Any  pebbling  strategy  for  the  m-input,  n-vertex  (n  =  m(m  +  l)/2)  pyramid 
graph  P(m)  requires  at  least  m  pebbles;  that  is,  a  minimum  space  Smin(P(m ))  =  m  >  y/2n- 
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Figure  I  0.3  The  pyramid  graph  on  six  inputs. 


1.  There  exists  a  pebbling  strategy  that  pebbles  P(m)  with  m  pebbles  using  one  pebble  placement 
per  vertex. 

Proof  The  lower-bound  proof  again  uses  the  fact  that  there  is  a  first  time  at  which  all  paths 
from  an  input  to  the  output  carry  pebbles.  Highlighted  in  Fig.  10.3  is  a  last  path  to  carry 
a  pebble.  Prior  to  the  placement  of  this  last  pebble,  all  paths  to  the  output  carry  pebbles. 
Thus,  with  the  placement  of  the  last  pebble  there  must  be  at  least  as  many  pebbles  on  the 
pyramid  graph  as  there  are  vertices  on  a  path  from  an  input  to  the  output,  namely,  m,  and 
m  >  \/2n  —  1.  (See  Problem  10.1.) 

With  m  pebbles,  the  vertices  can  be  pebbled  in  levels  by  first  placing  pebbles  on  each  of 
the  m  inputs.  Pebbles  are  then  advanced  to  vertices  on  the  second  level  from  left  to  right, 
and  this  process  is  repeated  at  all  levels  to  complete  the  pebbling.  Each  vertex  is  pebbled 
once  with  this  strategy.  ■ 

In  general,  it  is  very  hard  to  determine  the  minimum  number  of  pebbles  needed  to  pebble 
a  graph.  In  terms  of  the  complexity  classes  introduced  in  Chapter  8,  we  model  this  problem  as 
a  language  consisting  of  strings  each  of  which  contains  the  description  of  a  graph  G  =  ( V ,  E), 
a  vertex  v  €  V,  and  an  integer  S  with  the  property  that  the  vertex  can  be  pebbled  with  S  or 
fewer  pebbles.  The  language  of  these  strings  is  PSPACE-complete  (see  Section  8.12). 

10.3  Extreme  Tradeoffs 

We  now  show  that  extreme  space-time  tradeoff  behavior  is  possible.  We  do  this  by  exhibiting  a 
family  of  graphs,  Hi,  H2,  ■  ■  . ,  Hk , . . .  (Fig.  10.4),  that  requires  a  number  of  steps  exponential 
in  the  size  of  the  graph  when  the  minimum  number  of  pebbles  is  used  but  only  one  step  per 
vertex  when  one  more  pebble  is  available.  This  illustrates  that  excessive  minimization  of  the 
number  of  registers  used  by  programs  can  be  harmful! 

H 1  has  one  input  and  one  output  vertex  and  an  edge  connecting  them,  as  shown  in 
Fig.  10.4.  For  k  >  2  the  fcth  graph,  H &,  has  k  +  1  output  vertices  and  is  constructed  from 
one  copy  of  Hj~- 1,  a  tree  (on  the  left)  with  k  inputs,  a  two-level  bipartite  graph  (on  the  top 
right)  with  k  inputs  and  k  +  1  outputs,  and  a  chain  of  k  vertices  that  connects  the  tree  to  the 
outputs  of  iTfc_i  and  the  open  vertex.  (A  bipartite  graph  is  a  graph  in  which  the  vertices  are 
partitioned  into  two  sets  and  edges  join  vertices  in  different  sets.) 

We  summarize  our  pebbling  results  for  this  family  of  graphs  below.  Here  n\  is  the  factorial 
function  with  value  n!  =  n  ■  (n  —  1 )  •  (n  —  2)  •  . . .  •  2  •  1 . 
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Figure  I  0.4  A  family  of  graphs  exhibiting  an  extreme  tradeoff. 


THEOREM  1 0.3. 1  The  graph  Hk  has  N(k)  =  2  k2  +  5  fc  —  6  vertices  for  fc  >  2.  Any  pebbling 
strategy  for  the  graph  Hk  requires  at  least  k  pebbles,  k  =  @(-y/ N(k)).  Any  strategy  to  pebble  Hk 

with  k  pebbles  requires  at  least  (k  +  l)!/2  =  2 log  N(k'>)  steps>  whereas  there  exists  a 
pebbling  algorithm  using  k  +  1  pebbles  that  pebbles  each  vertex  of  Hk  once. 

Proof  Consider  a  pebbling  strategy  that  uses  k  +  1  pebbles  to  pebble  Hk-  For  the  case  of 
k  =  1,  Hk  can  be  completely  pebbled  with  one  move  per  vertex.  This  is  also  true  for  H2 
because  we  can  move  a  pebble  to  the  open  vertex  connected  to  the  bipartite  graph  using  two 
pebbles,  from  which  we  can  advance  two  of  our  three  pebbles  to  the  bottom  layer  of  the 
bipartite  graph  and  have  one  additional  pebble  with  which  to  pebble  the  output  vertices. 
Note  that  this  pebbling  strategy  allows  us  to  pebble  output  vertices  of  H2  from  left  to  right 
with  three  pebbles. 

Assume  that  we  can  pebble  the  outputs  of  Hk- 1  from  left  to  right  with  k  pebbles  without 
pebbling  any  vertex  more  than  once.  Then  to  pebble  Hk,  advance  a  pebble  to  the  root  of 
the  tree  on  the  left  and  then  pebble  the  outputs  of  Hk- 1  from  left  to  right  using  k  pebbles 
while  keeping  one  additional  pebble  on  the  chain.  Advance  this  pebble  along  the  chain  until 
it  reaches  the  open  vertex.  At  this  point  k  pebbles  can  be  advanced  to  the  bottom  row  of 
vertices  in  the  bipartite  graph  and  the  remaining  pebble  used  to  pebble  outputs  from  left  to 
right.  This  shows  that  our  assumption  holds. 

The  minimum  number  of  pebbles  needed  to  pebble  Hk  is  at  least  k  because  at  least  this 
many  are  needed  to  pebble  the  tree  on  the  left.  To  show  that  this  value  can  be  achieved,  we 
give  a  recursive  pebbling  strategy.  Observe  that  H i  can  be  pebbled  with  k  =  1  pebbles.  To 
pebble  Hk,  assume  that  we  can  pebble  any  one  output  of  Hk- 1  with  fc  —  1  pebbles.  Advance 
a  pebble  to  the  root  of  the  left  tree  and  then  advance  it  along  the  chain  by  pebbling  output 
vertices  of  Hk- 1  from  left  to  right  with  fc  —  1  pebbles.  Move  a  pebble  to  the  open  vertex 
and  then  to  all  vertices  on  one  side  of  the  bipartite  graph.  Any  one  output  vertex  can  now 
be  pebbled.  However,  doing  so  requires  that  one  vertex  on  the  bottom  side  of  the  bipartite 
graph  lose  its  pebble.  Thus,  no  other  output  vertex  can  be  pebbled  without  repebbling  the 
tree  and  all  vertices  of  Hk- 1. 
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As  this  pebbling  strategy  demonstrates,  to  pebble  an  output  vertex,  all  k  pebbles  must 
move  to  the  bottom  of  the  bipartite  graph,  thereby  removing  all  pebbles  from  other  vertices 
of  Hk-  Let  M(k)  be  the  number  of  pebble  placements  to  pebble  Hk  with  k  pebbles.  It 
follows  that  to  pebble  each  of  the  (k  +  1)  outputs  of  Hk  with  k  pebbles,  we  must  pebble 
each  output  of  Hk- 1  with  k  —  1  pebbles.  Thus, 

M(k)  >  {k+  1)  x  M(k-  1) 

>  (k  +  1  )k(k  -  1)  •  •  •  3  •  1  =  (k  +  1)1/2 


which  provides  the  desired  lower  bound. 

Let  the  graph  Hk  have  N(k)  vertices.  Then  iV(l)  =  2,  N(2)  =  12  and  N(k)  = 
N(k  —  1)  +  4k  +  3  for  k  >  3.  A  straightforward  proof  by  induction  shows  that  N(k)  = 
2 k2  +  5 k  —  6  (see  Problem  10.13). 

To  show  that  M(k )  >  (k  +  l)!/2  is  exponential  in  N(k)  =  2 k2  +  5k  —  6,  note  that 
p\  =  p-{p—  l)-..--3-2-l,  which  is  at  least  (p/2)  since  each  of  the  first  p/2  terms  is  at 
least  p/2.  Thus,  M(k)  >  .5[(k+  l)/2](fc+1)/2  Also,  it  is  easy  to  see  that  N( k)  <  3(fc+l)2 
for  k  >  1.  Since  this  implies  \J N(k)/3  <  (k  +  1),  we  have  that 


M(k)  >  .5  \UN(k)/3)/2 


(^N(k)/3)/2 


which  is  exponential  in  N(k).  ■ 


Many  vertices  in  the  graph  Hk  have  a  fan-in  k.  A  new  family  {Gfc}  of  graphs  with  fan-in 
2  can  be  obtained  by  replacing  the  tree  on  the  left  in  Hk  with  the  pyramid  graph  of  Fig.  10.3 
and  replacing  the  bipartite  graph  on  the  top  with  a  new  graph  (see  Problem  10.14).  This  new 
graph  exhibits  an  exponential  jump  in  the  time  to  pebble  the  graph  but  at  a  value  of  space  that 
is  the  fourth  root  of  the  number  of  vertices  in  Gk  . 


10.4  Grigoriev’s  Lower-Bound  Method 

In  this  section  we  present  a  method  for  developing  lower  bounds  on  the  exchange  of  space  for 
time  in  the  pebble  game.  These  lower  bounds  are  typically  of  the  form  (S  +  1)T  =  f l(n2), 
where  S,  T,  and  n  are  the  space,  time,  and  the  size  of  the  input  to  the  problem,  and  are  similar 
in  spirit  to  those  of  Theorem  3.6.1.  Because  they  assume  a  less  general  model  of  computation 
(the  pebble  game  instead  of  the  RAM),  lower  bounds  are  easier  to  derive. 

The  lower  bounds  use  as  a  measure  the  maximum  amount  of  information  that  can  flow 
from  a  subset  of  the  inputs  to  a  subset  of  the  outputs,  and  are  much  easier  to  derive  than  are 
lower  bounds  on  circuit  size  for  the  circuit  model.  Although  the  results  are  stated  for  straight- 
line  computations,  they  apply  to  all  “input-output-oblivious”  computations  by  finite-state  ma¬ 
chines:  computations  in  which  inputs  are  read  and  outputs  produced  at  times  independent  of 
the  values  of  the  input  variables.  (See  Problem  10.20.) 

1 0.4. 1  Flow  Properties  of  Functions 

We  start  by  defining  a  flow  property  of  functions.  (See  Fig.  10.5.)  A  function  /  :  An  i— >  Am 
has  a  large  information  flow  from  input  variables  in  X\  to  output  variables  in  Y\  if  there  are 
values  for  input  variables  in  Xq  =  X  —  X\  such  that  many  different  values  can  be  assumed  by 
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Figure  10.5  A  function  /  that  has  a  large  information  flow  from  input  variables  in  A'i  to 
output  variables  in  Y\  for  some  values  of  input  variables  in  Xo  =  X  —  X\. 


outputs  in  Y\  as  inputs  in  X\  range  over  all  their  |^4|  LYl  I  values.  This  flow  property  is  also  used 
in  Section  12.7  to  derive  lower  bounds  on  the  exchange  of  area  for  time  in  the  VLSI  model  of 
computation. 

DEFINITION  10.4.1  Afunctioti  f  :  An  i— >  Am  has  a  w(u,v) -flow  if for  all  subsets  X\  and 
Y\  of  its  n  input  and  m  output  variables,  with  \X\  \  >  u  and  \  Y\  \  >  v,  there  is  a  subfunction 
h  of  f  obtained  by  making  some  assignment  to  variables  of  f  not  in  X\  (variables  in  X0 )  and 
discarding  output  variables  not  in  Y\  such  that  h  has  at  least  |w4|u’(“-1’)  points  in  the  image  of  its 
domain. 

The  exponent  function  w{u,v )  is  a  nondecreasing  function  of  both  of  its  arguments:  in¬ 
creasing  u,  the  number  of  variables  that  are  allowed  to  vary,  can  only  increase  the  number  of 
values  assumed  by  /;  the  same  is  true  if  v  is  increased. 

An  important  class  of  functions  are  the  (a,  n,  m,  p) -independent  functions  defined  below. 

DEFINITION  I  0.4.2  Afutiction  f  :  A"  i— >  A"1  is  an  (a,  n,  m,p)  -independent  function  for 

a  >  1  andp  <  m  if  it  has  a  w(u,  v)-flow  satisfying  w(u,  v)  >  (v/a)  —  1  forn  —  u  +  v  <  p. 

We  illustrate  the  independence  property  of  a  function  with  matrix  multiplication:  we  show 
that  the  function  defined  by  the  product  of  two  nxn  matrices  is  (1,  In1,  n2,  n) -independent. 
In  Section  10.5.4,  we  show  that  a  stronger  property  holds  for  matrix  multiplication. 

The  proof  of  the  independence  property  of  nxn  matrices  uses  the  permutation  matrices 
described  in  Section  6.2.  An  nxn  permutation  matrix  is  obtained  by  permuting  either  the 
rows  or  columns  of  the  nxn  identity  matrix.  When  a  permutation  matrix  B  multiplies  another 
matrix  A  on  the  right  (left)  to  produce  AB  (BA),  it  permutes  the  columns  (rows)  of  A. 

LEMMA  10.4.1  The  matrix  multiplication  function  :  izln2 1 — ►  nn 2  over  the  ring  1Z  is 

(1,  In1,  n2,  n)  -independent. 
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Proof  Let  C  =  AB  be  the  product  of  n  x  n  matrices  A  and  B.  Consider  any  set  X0  of 
input  variables  (entries  of  A  and  B)  and  any  set  Y\  of  output  variables  (entries  of  C)  such 
that  |Xo|  +  \  Y\  |  =  n.  The  outputs  in  Y|  fall  into  at  most  \  Y\  |  columns  of  C  and  the  inputs 
in  Xq  fall  into  at  most  |A"o|  columns  of  A.  It  follows  that  at  least  n  —  |A0|  columns  of  A 
contain  only  variables  in  X\ .  Fix  the  entries  in  B  so  that  it  forms  a  permutation  matrix  that 
permutes  the  columns  of  A  containing  only  elements  in  X\  onto  columns  of  C  containing 
elements  of  Y\ .  (We  are  free  to  make  the  best  assignment  of  variables  in  B,  whether  in  Xq 
or  Xi.)  It  follows  that  each  output  variable  in  Yi  is  assigned  to  an  input  variable  of  A  in  X\ 
by  this  permutation.  Thus  these  output  variables  are  free  to  assume  different  values. 

Since  this  is  more  than  1 1Z 1 1 Yl I  it  follows  that  is  (1,  2 n2,  n2,  n) -independent.  ■ 

As  this  result  illustrates,  for  any  set  of  y\  outputs  of  the  matrix  multiplication  function  and 
any  set  of  Xq  of  its  inputs  satisfying  Xo  +  y\  <  p,  there  is  some  assignment  to  these  inputs  such 
that  there  is  a  large  flow  of  information  from  the  complementary  set  of  inputs,  X\ ,  to  any  set 
y\  of  its  outputs. 

10.4.2  The  Lower-Bound  Method  in  the  Basic  Pebble  Game 

The  following  theorem  provides  a  lower  bound  on  the  exchange  of  space  for  time.  Its  proof 
uses  a  variant  of  the  pigeonhole  principle.  Since  the  pebbling  of  vertices  is  assumed  to  occur 
sequentially,  time  is  divided  into  intervals  in  which  the  number  of  output  vertices  pebbled,  b,  is 
chosen  to  be  a  small  multiple  of  the  number  of  pebbles,  S,  used  in  pebbling.  The  pigeonhole 
principle  is  used  to  show  that  a  large  number  of  inputs  must  be  pebbled  in  each  interval. 
In  particular,  we  show  that  if  the  number  of  inputs  pebbled  inside  an  interval  is  small,  the 
number  of  inputs  outside  the  interval  is  large  enough  that  there  is  a  large  flow  from  the  inputs 
outside  the  interval  to  the  outputs  inside  it.  However,  the  flow  cannot  be  any  larger  than  can 
be  supported  by  the  number,  S,  of  vertices  carrying  pebbles  just  before  the  interval.  Thus,  the 
number  of  input  variables  outside  the  interval  is  small,  which  implies  that  the  number  inside  is 
large.  That  is,  many  inputs  must  be  pebbled  within  each  interval.  Multiplying  by  the  number 
of  intervals  in  which  b  outputs  are  pebbled  provides  the  lower  bound. 

THEOREM  1 0.4. 1  Let  f  :  An  i— >  Am  have  an  w(u,  v)-flow  and  let  it  be  realized  by  a  straight- 
line  program  over  a  basis  {h  :  Ar  i— >  As  \  r,  s  >  1}.  For  arbitrary  b  <  m,  every  pebbling  of 
every  DAG  for  f  requires  space  S  and  time  T  satisfying  the  inequality 

T  >  [ m/b\  ( n  —  d) 

where  d  is  the  largest  integer  such  thatw[d,  b)  <  S. 

Proof  Assume  that  G  =  ( V ,  E)  is  pebbled  with  S  >  1  pebbles  in  T  >  1  steps.  Let 
Tj  <  T  be  the  number  of  times  that  input  vertices  are  pebbled.  (This  is  generally  more 
than  the  number  of  input  variables.) 

Given  a  pebbling  of  G  with  S  pebbles,  group  the  consecutive  pebbling  steps  into  in¬ 
tervals,  the  first  [m/b\  of  which  contain  b  pebbled  outputs  and  one  of  which  contains 
m  —  b(\m/b\)  pebbled  outputs. 

Consider  an  arbitrary  interval  1  in  which  b  outputs  are  pebbled.  Let  Y\  be  these  outputs 
and  let  Xq  and  X\  be  the  number  of  inputs  pebbled  inside  and  outside  the  interval,  respec¬ 
tively.  By  definition,  there  is  an  assignment  to  the  Xq  inputs  such  that  that  the  b  =  |Yi| 
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outputs  have  at  least  |v4|u'(:2:i'h)  different  values.  If  w(x\,  b)  >  S,  the  outputs  Y[  assume 
more  values  than  can  be  taken  by  the  S  pebbles  in  use  just  prior  to  the  start  of  I.  Because 
the  values  of  variables  in  Y\  are  determined  by  the  inputs  pebbled  in  X,  which  are  fixed,  and 
the  values  under  the  S  pebbles,  this  contradicts  the  definition  of  /.  It  follows  that  X\  can  be 
no  larger  than  d,  where  d  is  the  largest  value  such  that  w(d,  b)  <  S.  Thus  the  number  of 
inputs  pebbled  ini,  Xo,  satisfies  Xq  >  (n  —  d). 

Since  there  are  [m/bj  intervals  in  which  b  outputs  are  pebbled,  the  number  of  times 
that  inputs  are  pebbled,  Tj,  is  at  least  [m/b\  (n  —  d).  ■ 

Grigoriev  [121]  established  the  above  theorem  for  (1,  n,  m,p) -independent  functions.  We 
restate  as  a  corollary  a  slightly  revised  version  of  his  theorem  for  (a,  n,  m,p)  -independent 
functions. 

COROLLARY  10.4.1  Let  f  :  An  *—>  Am  be  (a,  n,  m,p)  -independent  and  let  it  be  realized  by  a 
straight-line  program  over  a  basis  {h  :  Ar  i— >  As  \  r,s>  1}.  Every  pebbling  of  every  DAG  for  f 
requires  space  S  and  time  T  satisfying  the  inequality 

[a©  +  1)]T  >  mp/4 

Proof  An  (ct,  n,  m,p) -independent  function  on  n  inputs  has  a  w(u,  i^-flow  satisfying 
w(u,v)  >  (v/a)  —  1  for  n  —  u  +  v  <  p,  where  Xq  =  n  —  u  >  0.  Since  b  can  be 
freely  chosen,  let  b  =  [ct©  +  1)].  Thus,  (6/cr)  —  1  >  S  for  (n  —  d)  +  b  <  p,  which 
contradicts  the  requirement  that  w(d,  b)  <  S.  It  follows  that  (n  —  d)  +  b  >  p  or  that 
(n  —  d)  >  p  —  [a(S  +  1)].  With  the  inequality  \m/x J  >  (to  —  x  +  l)/x  (see  Prob¬ 
lem  10.2),  the  following  lower  bound  follows  from  Theorem  10.4.1: 

(to  \a(S  +  l)]  +  1  )(p-  M S  +  I)]) 

[a© +1)1 

Since  p  <  to,  if  [a©  +  1)]  <  p/2,  the  desired  lower  bound  follows.  On  the  other  hand, 
if  [ct©  +  1)]  >  p/2,  [ct©  +  1)]T  >  mp/2  since  T  >  m.  ■ 

It  is  possible  that  a  function  /  :  An  i— >  Am  is  not  (ct,  n,  to,  p) -independent  but  a  sub¬ 
function  g  :  Ar  i— >  As  is  (ct,  r,  s,p) -independent  for  r  <  n  and  s  <  m.  (Subfunctions  are 
defined  in  Section  2.4.)  As  shown  in  Problem  10.18,  the  lower  bound  for  the  subfunction  g 
applies  to  /. 

Lower  bounds  on  space-time  exchanges  can  also  be  derived  using  properties  of  the  graphs 
to  be  pebbled.  For  example,  if  a  graph  contains  a  superconcentrator  (defined  in  Section  10.8), 
lower  bounds  on  the  product  can  be  derived  on  ©  +  1)T  in  terms  of  the  number  of  inputs  of 
the  graph.  (See  Problem  10.28.) 

As  mentioned  at  the  beginning  of  this  section,  Theorem  10.4.1  is  much  more  general 
that  it  appears.  In  Problem  10.20  the  reader  is  asked  to  show  that  the  lower  bound  holds  for 
“input-output-oblivious”  finite-state  machines,  FSMs  that  compute  functions  but  read  their 
inputs  and  produce  their  outputs  at  data-independent  times.  Problem  10.21  asks  the  reader  to 
establish  that  pebblings  of  straight-line  computations  can  be  translated  directly  into  computa¬ 
tions  by  finite-state  machines. 
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Figure  I  0.6  Pebbling  an  inner  product  graph  with  three  pebbles. 


10.4.3  First  Matrix  Multiplication  Bound 

The  Grigoriev  lower-bound  method  is  well  illustrated  by  matrix  multiplication.  We  established 
its  independence  property  in  Section  10.4.1.  In  this  section  we  apply  it  to  Corollary  10.4.1. 
The  upper  bound  stated  in  the  following  theorem  follows  from  the  development  of  an  algo¬ 
rithm  for  matrix  multiplication  that  uses  three  pebbles  and  executes  at  most  4n3  steps.  This 
algorithm,  based  on  the  standard  matrix  multiplication  algorithm  of  Section  6.2.2,  forms  each 
of  the  n2  inner  products  defined  by  the  product  of  two  n  x  n  matrices  using  three  pebbles,  as 
suggested  in  Fig.  10.6,  and  An  —  1  steps. 

THEOREM  1 0.4.2  Every  pebbling  strategy  for  straight-line  programs  computing  the  matrix  multi¬ 
plication  function  /As  ■  & 2n  i— >  Bn  for  n  X  n  matrices  requires  space  S  and  time  T  satisfying 
the  following  inequality: 

(S  +  \)T  >  n3/4 

The  standard  algorithm  for  multiplying  n  X  n  matrices  uses  space  and  time  satisfying 

(S+  1  )T=  16  n3 

Those  familiar  with  fast  non-standard  matrix  multiplication  algorithms  such  as  Strassen’s 
fast  matrix  algorithm  (Section  6.3)  may  find  this  result  surprising.  Whereas  one  learns  that 
the  standard  matrix  multiplication  algorithm  is  not  optimal  with  respect  to  computation  time, 
the  above  result  states  that  the  standard  matrix  multiplication  algorithm  is  nearly  optimal  with 
respect  to  the  space-time  product. 

In  Section  10.5.4  we  specialize  Theorem  10.4.1  to  the  flow  properties  of  matrix  multipli¬ 
cation,  giving  a  stronger  result:  that  the  space  and  time  for  matrix  multiplication  must  satisfy 
the  inequality  ST 2  =  f l(n6). 

10.5  Applications  of  Grigoriev’s  Method 

Given  the  above  results,  to  derive  a  lower  bound  on  [(fyS1  +  1)]T  using  Corollary  10.4.1 
it  suffices  to  establish  the  independence  property  of  a  function.  We  apply  this  idea  in  this 
section  to  convolution,  cyclic  shifting,  integer  multiplication,  matrix-vector  multiplication, 
matrix  inversion,  and  solving  linear  equations.  We  apply  related  arguments  to  derive  lower 
bounds  for  the  discrete  Fourier  transform  and  merging.  Finally,  we  apply  Theorem  10.4.1  to 
derive  a  lower  bound  on  space-time  exchanges  for  matrix-matrix  multiplication  that  improves 
upon  the  bound  of  Section  10.4.3.  Where  possible  we  also  derive  upper  bounds  on  space-time 
tradeoffs. 
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10.5.1  Convolution 

The  wrapped  convolution  on  strings  of  length  n  over  the  ring  TZ,  f!fapped  ■  ^ 2™  l— ^  TZn,  is 
defined  in  Problem  6.19.  It  can  be  characterized  by  the  following  product  of  a  circulant  matrix 
with  a  vector  (see  Section  6.2): 


Wo 

u0 

^n—  1 

^n—2 

Ml 

Vo 

W\ 

U\ 

u0 

^n—  1 

U2 

Vi 

w2 

= 

X 

V2 

Un— 2 

^n— 3 

Un—4 

n>n—  1 

wn-  1 

^n—  1 

^n— 2 

Un—3 

M0 

.  Vn-l 

Lemma  10.5.1  demonstrates  (2,  2n,  n,  n/2) -independence  for  the  wrapped  convolution 
/wrapped  ’  ^2n  l— >  '^n  function  by  showing  that  for  any  set  Xo  of  inputs  there  is  a  way  to  put 
\Y}  |/2  of  the  inputs  in  X  —  Xq  into  a  one-to-one  correspondence  with  | Y\  |/2  entries  in  any 
set  Yi  of  outputs.  This  is  established  by  setting  one  component  of  v  to  1  and  the  rest  to  0. 


LEMMA  10.5.1  For  n  even,  the  wrapped  convolution  f^appe( j  :  TZ2n  i— >  7 Zn  over  the  ring  7 Z  is 
(2,  2 n,  n,  n / 2)  -independent. 

Proof  Consider  subsets  Xo  and  Y\  of  the  inputs  X  and  outputs  Y  of  /Rapped  satisfying 
\X0\  +  \Y1\=p  =  n/2.  For/Wpped  to  be  (2,  2 n,  n,  n/2) -independent,  there  must  be 
an  assignment  to  input  variables  in  Xo  such  that  the  output  variables  in  Yi  have  more  than 
j7^.|(|Vi|/2)-i  distinct  values  as  the  input  variables  of  /dipped  m  =  X  —  Xo  range  over 
all  possible  values. 

As  shown  above,  /gripped  defined  by  a  matrix- vector  product  w  =  Mv,  M  a  cir¬ 
culant  matrix,  in  which  each  row  (column)  is  a  cyclic  shift  of  the  first  row  (column).  Let 
e  =  |Xo  D  {uo,  Hi, . . . ,  iin—\ } | .  Thus,  every  row  of  M  contains  the  same  number  e  of 
entries  from  Xo.  Also,  n  —  e  inputs  are  in  X\  =  X  —  Xq.  The  entries  in  Xi  are  free  to  vary. 

Each  output  in  Y\  corresponds  to  a  row  of  M.  The  number  of  instances  of  input 
variables  from  X\  in  these  rows  is  |Yi|(n  —  e).  Since  these  rows  have  n  columns,  there 
is  some  column,  say  the  fth,  containing  at  least  the  average  number  of  instances  from  Xp 
This  average  is  |Yj  |(1  —  e/n)  >  |Yj  \/2.  (The  instances  of  variables  from  Xj  in  a  column 
are  distinct.)  It  follows  that  by  choosing  the  ith  component  of  v,  vt,  to  be  1  and  the 
others  to  be  0,  at  least  |  Y\  |/2  of  the  inputs  in  X\  are  mapped  onto  outputs  in  Yj.  Since 
these  inputs  (and  outputs)  can  assume  |Xp  '^2  different  values,  it  follows  that  f^/Ja pped  is 
(2,  2 n,  n,  n/2) -independent.  ■ 


This  implies  the  lower  bound  stated  below.  The  upper  bound  follows  from  the  standard 
matrix-vector  algorithm  for  the  wrapped  convolution  using  the  observation  that  an  inner  prod¬ 
uct  can  be  done  with  three  pebbles,  as  suggested  in  Fig.  10.6. 

THEOREM  10.5.1  The  time  T  and  space  S  required  to  pebble  any  straight-line  program  for  the 
standard  or  wrapped  convolution  must  satisfy  the  following  inequality: 

(S  +  l)T  >  n2/16 

This  lower  bound  can  be  achieved  to  within  a  constant  midtiplicative  factor  for  S  =  0(1). 
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10.5.2  Cyclic  Shifting 

The  cyclic  shifting  function  /^lie  •  Bn+^°En~]'  i— >  Bn  defined  in  Section  2.5.2  is  a  sub¬ 
function  of  many  functions,  including  integer  multiplication  and  squaring  (see  Section  2.9.5), 
integer  reciprocal  (see  Section  2.10.1),  and  powers  of  integers  (see  Problems  2.34  and  2.35). 

Cyclic  shifting  is  another  good  example  of  a  problem  for  which  a  lower  bound  on  the 
exchange  of  space  and  time  exists.  The  method  used  to  establish  the  independence  properties 
of  this  function  can  be  generalized  to  the  class  of  transitive  functions.  (See  Problem  10.22.) 

We  redefine  /<j yC\ic  here-  Let  &  =  [log  n~\  •  The  input  variables  of  are  segmented 

into  two  groups,  an  n-tuple  x  =  ( Xn-\ , .  .  .  ,X\,  Xq)  of  value  variables  and  a  fc-tuple  s  = 
(sfc_  i, . . . ,  si,  So)  of  control  variables.  The  control  variables  specify  the  integer  |s|: 

ls|  =  Sfc- i2fc  1  +  •  •  •  +  Si2*  +  so 

|  s  |  is  the  number  of  places  by  which  the  value  inputs  must  be  shifted  left  cyclically  to  produce 
the  output  n-tuple  y  =  (yn_  u  yuy0).  That  is,  f^hfyx,  s )  =  (l/)>  where 

Vo  =  xU-\s\)  mod  „  for  0  <j<  [logn)  -  1  (10.2) 

A  circuit  to  implement  yC\ic  is  given  in  Section  2.5.2  that  cyclically  shifts  x  left  by  2J  places 
for  each  of  those  values  of  j,  0  <  j  <  [log  n]  —  1,  such  that  Sj  =  1 . 

The  independence  properties  of  the  cyclic  function  are  shown  by  demonstrating  that  some 
permutation  of  the  input  vector  x  aligns  unselected  inputs  with  selected  outputs. 

LEMMA  10.5.2  fcyhic  ■  £>ra+riogrf  i— >  Bn  is(2,n+  \\ogn\,n,n/2)-independent. 

Proof  Consider  subsets  Xq  and  Yj  of  the  inputs  X  and  outputs  Y  of  fj: "c^ic  satisfying 

|X0|  +  |Yi|  =  p  =  n/2.  For  fcyhic  to  be  (2,  n+  [logn] ,  n,  n/2) -independent,  there  must 
be  an  assignment  to  input  variables  in  Xq  such  that  the  output  variables  in  Y\  have  more 
than  |£>|d>1l/2)-1  distinct  values  as  the  input  variables  of  /c^lic  in  =  X  —  XQ  range 
over  all  possible  values. 

Let  Xq  contain  e  elements  from  x.  Let  yi  €  Y\.  As  s  runs  through  all  possible  shift 
values,  yi  is  made  equal  to  every  one  of  the  inputs  in  x.  For  n  —  e  of  these  shifts  yi  is 
set  equal  to  an  input  in  X\  =  X  —  9fo.  (For  example,  if  n  =  6  and  e  =  2,  say  with 
X\  =  {xq,Xs,xa,x<,}  and  Y\  =  {y2, 2/3, 2/5},  then  as  s  ranges  over  all  of  its  values,  each 
of  the  three  yi  in  Yt  is  assigned  four  different  variables  in  .)  Thus,  the  number  of  input 
variables  assigned  to  outputs,  summed  over  all  cyclic  shifts,  is  |Y|  |(n  —  e).  Since  there  are 
n  cyclic  shifts,  for  some  shift  the  number  of  variables  in  X\  that  are  matched  with  outputs 
in  Y\  is  at  least  the  average  of  this  quantity;  that  is,  at  least  |Yi  |(1  —  e/n)  >  \Y\  |/2.  Thus, 
some  shift  sets  at  least  |Y1 1/2  inputs  in  X\  to  outputs  in  Y\ .  Since  these  outputs  can  assume 
£>|IK|I/2  different  values,  it  follows  that  is  (2,  n  +  [log  n] ,  n,  n/2) -independent.  ■ 

THEOREM  1 0.5.2  Every  pebbling  strategy  for  straight-line  programs  computing  the  cyclic  shifting 
function  /c^?iic  :  Bn+  nl  Bn  requires  space  S  and  time  T  satisfying  the  inequality 


(S  +  1)T  >  n2/16 
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An  algorithm  exists  to  compute  /<!”c\ic 
satisfies  the  inequality 


that  uses  space  0(n )  and  time  O(nlogn),  namely,  that 


(S  +  1)T  =  0(n2  log  n) 


Proof  We  leave  the  upper-bound  proof  to  the  reader.  (See  Problem  10.30.)  ■ 


We  now  apply  this  result  to  integer  multiplication. 


10.5.3  Integer  Multiplication 

To  apply  Grigoriev’s  method  to  the  binary  integer  multiplication  function  :  B2n  i— >  B2n 
of  Section  2.9,  we  assemble  a  collection  of  results  to  show  that  with  the  proper  encoding  of  one 
of  its  two  arguments,  /©]t  computes  the  logical  shifting  function  (see  Lemma  2.9.1) 

and  when  n  is  even  the  logical  shifting  function  /©|t  contains  the  cyclic  shift  function 

as  a  subfunction  (see  Lemma  2.5.2).  Thus,  /©lt  contains  as  a  subfunction.  We  use 

this  fact  to  obtain  a  lower  bound  on  the  space-time  product  for  integer  multiplication. 

THEOREM  10.5.3  Let  n  he  even.  Every  pebbling  strategy  for  straight-line  programs  computing  the 
binary  integer  multiplication  function  ffift  •  ^2n  l— >  &2n  requires  space  S  and  time  T  satisfying 
the  following  inequality: 

(5+  1)T  >  n2/64 

An  algorithm  exists  for  multiplying  n-bit  integers  using  space  0(  log2  n)  and  time  0(n2),  namely, 
that  satisfies 

(S'  +  1)T  =  0(n2  log2  n) 

Proof  The  lower-bound  argument  is  given  above.  The  upper  bound  follows  from  a  pebbling 
of  an  integer  multiplication  circuit  to  multiply  n-bit  binary  integers  u  and  v.  The  circuit  is 
based  on  the  following  standard  expansion  of  their  product: 


V3U0 

V2U0 

V1U0 
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V3U1 

V2Ui 

V1U1 
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V3U2 
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0 

0 

0 

To  construct  a  circuit  we  use  the  observation  that  the  number  of  Is  in  the  jth  column  is  the 
jth  component,  Wj,  of  the  convolution  w  =  u  Cg)  v.  (See  Section  6.7.4.) 

To  compute  Wj  we  use  the  counting  circuit  /f©r)t  :  Bn  i— >  B^ogn^  of  Section  2.1 1  on 
n  inputs  to  count  the  number  of  Is  among  the  products  urvs  of  the  Boolean  variables  ur 
and  vs  in  the  sum 

Wj  =  ur  *  vs  for  0  <  j  <  2n  —  2 

r-\-s=j 

To  compute  the  2n-bit  product  we  add  the  binary  representations  forwo,Wi, . . . ,  W2n-2 
in  a  set  of  (2 n  —  1)  ripple  adders,  adding  Wj  to  the  sum  <j(j)  =  ^fo<i<j-i  as 

suggested  in  Fig.  10.7,  where  we  omit  the  counting  circuits  used  to  compute  the  values  of 

w0, . .  .,w2n- 2- 
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Figure  I  0.7  A  multiplication  circuit  that  can  be  pebbled  in  0(n2)  time  and  Oflog2  n)  space. 
The  counting  circuits  that  generate  Wo,  Wi, . .  . ,  Wm-i  are  not  shown. 


Each  counting  function  can  be  pebbled  with  O(n)  steps  using  0( log2  n)  pebbles  with¬ 
out  repebbling  vertices.  (See  Problem  10.10.)  After  the  counting  circuit  is  pebbled,  pebbles 
remain  on  their  outputs  until  their  values  have  been  used  elsewhere  in  the  multiplication 
circuit. 

The  value  of  Wj  is  represented  by  a  fc-tuple,  k  <  |~log2  n\ .  The  value  of  &(j)  is  repre¬ 
sented  by  at  most  |~log2(n(2J  —  1))]  <  j  +  [log2  n]  bits  since  it  is  the  sum  of  at  most  n 
/-bit  binary  numbers.  Because  Wj  is  added  after  the  first  j  bits,  the  pebbles  on  these  bits  can 
be  discarded.  Only  |~log2  n]  bits  of  the  running  sum  and  a  like  number  for  Wj  are  needed  to 
hold  values  on  the  inputs  to  the  ripple  adder.  A  fixed  additional  number  of  pebbles  suffices 
to  pebble  the  internal  vertices  of  the  adder.  On  completion  of  the  sum  only  |"log2  n]  pebbles 
are  needed.  They  are  used  to  hold  the  portion  of  the  running  sum  that  is  used  in  the  next 
stage  of  addition. 

For  each  value  of  j,  0  <  j  <  2  (n  —  1 ),  0(log  n )  steps  are  executed  in  the  ripple  adder 
and  0(n)  steps  are  executed  in  a  counting  circuit.  Consequently,  0(log2n)  pebbles  and 
0(n2)  time  suffice  to  compute  the  product  of  n-bit  binary  numbers.  ■ 

In  Section  10.13.2  we  show  that  a  lower  bound  off l(n2/  log*  n)  applies  under  the  branch¬ 
ing  program  model.  The  stronger  lower  bound  of  fl(n2)  derived  here  reflects  the  extra  con¬ 
straints  imposed  on  the  pebble  game,  namely  that  inputs  are  read  and  computations  performed 
at  data-independent  times. 

Similar  results  apply  to  the  squaring  function  /j quare  since,  as  shown  in  Lemma  2.9.2, 
/square^  contains  f^it  as  a  subfunction.  (See  Problem  10.32.) 

Similar  results  also  apply  to  the  reciprocal  function  f^ip  '■  Bn  i— >  Bn  since,  as  shown 
in  Lemma  2.10.1,  f^cllp  contains  as  a  subfunction  the  squaring  function  /squire  for  m  = 
[n/12j  —  1.  (See  Problem  10.33.) 


10.5.4  Matrix  Multiplication 

In  this  section  we  show  that  the  matrix  multiplication  function  is  richer  than  the  other  func¬ 
tions  examined  above  in  that  it  exhibits  a  stronger  space-time  lower  bound  than  given  in 
Theorem  10.4.2.  After  we  derive  a  lower  bound  on  the  function  w(u,  v)  we  specialize  Theo¬ 
rem  10.4.1  to  this  case,  thereby  deriving  the  stronger  lower  bound. 
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LEMMA  10.5.3  The  matrix  multiplication  function  f^xB  '■  ^2n  1—1 *  Pn  over  the  ring  1Z  has 
a  w(u,  v)-flow,  where  w(u,  v )  satisfies  the  following  lower  bound: 

w(u,  v)  >  (v  —  (2 n2  —  u)2 /An2)/2 

Proof  Let  C  =  AB  be  the  product  ofnxn  matrices  A  and  B.  We  establish  this  result  by 
using  characteristic  functions  to  identify  the  outputs  in  C  in  Y\  and  the  inputs  in  A  and  B 
in  X\,  as  indicated  below.  Here  the  indices  i  and  j  range  over  0  <  i,j  <  n  —  1: 

\  1  ci,j  €  Y\  (  1  dij  £  X\ 

aiyj  =  <  .  a-ij  =  < 

0  otherwise  ’  0  otherwise 


bi,j  £  X\ 

otherwise 

Let  A,  B,  and  C  denote  the  matrices  [a©,  and  [cr  if,  respectively.  Denote  by  \A\, 
\B\,  and  |C|  the  number  of  Is  in  the  three  corresponding  matrices.  Note  that  |  A\  +  \B\  = 
\X\  |  and  \C\  =  |YJ|. 

The  fcth  n  X  n  cyclic  permutation  matrix  P(k )  is  the  n  X  n  identity  matrix  in  which 
the  rows  are  rotated  cyclically  k  —  1  times.  For  example,  the  following  3x3  matrix  is  P( 3). 

'010' 

0  0  1 
1  0  0 

Let  D  be  an  n  x  n  matrix.  The  matrix  P{k)D  consists  of  the  rows  of  D  shifted  cyclically 
down  k  —  1  places.  Similarly,  the  matrix  DP(k)  consists  of  the  columns  of  D  shifted 
cyclically  left  k  —  1  places. 

Let  B(k)  be  the  matrix  B  obtained  by  multiplication  on  the  left  by  A  =  P(k).  Sim¬ 
ilarly,  let  A(k)  be  the  matrix  A  obtained  by  multiplication  on  the  right  by  B  =  P(k). 
Then,  a  1  value  for  the  (i,j)  entry  in  A(k)  and  B(k)  identifies  a  variable  in  X\  that  is 
mapped  to  an  output  variable  of  C  through  its  multiplication  by  P{k). 

Let  D  and  E  be  n  X  n  matrices  whose  entries  are  drawn  from  the  set  {0,  1}.  We  denote 
by  D  D  E  the  n  x  n  matrix  whose  (*,  j)  entry  is  1  if  dij  =  eij  =  1 .  Similarly,  let  D  U  E 
be  the  n  x  n  matrix  whose  (i,  j)  entry  is  1  if  either  dltl  =  1  or  eij  =  1.  The  following 
identity  applies: 

\DUE\  +  \DHE\  =  \D\  +  \E\  (10.3) 

Since  \D  U  E\  <  n2  for  n  x  n  matrices,  the  following  inequality  holds: 

\DHE\  >  \D\  +  \E\~n2  (10.4) 

Also,  since  |  D  D  E  \  >  0  we  have 

\D\  +  \E\  >  \DUE\  (10.5) 

The  w(u,  ti)-flow  of  matrix  multiplication  is  large  if  for  some  choice  of  r  or  s  \  CC\A(r)  \ 
or  \C  n  B(s)|  is  large.  This  follows  because  choosing  A  to  be  the  rth  cyclic  permutation 
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makes  many  variables  of  B  in  X\  match  entries  in  C  in  Yj ,  or  choosing  B  to  be  the  sth 
cyclic  permutation  makes  many  variables  of  A  in  X\  match  entries  in  C  in  Y\.  When  an 
input  and  output  variable  match,  the  latter  assumes  the  value  of  the  former.  Thus,  all  the 
variation  in  the  former  is  reflected  in  the  latter. 

Let  Q  =  |C  (~1  A(r)|  +  \C  n  S(s)|.  Then  the  w{u,  t)-flow  is  at  least  Q / 2.  Applying 
(10.5)  and  then  (10.4)  to  Q,  we  have  the  following  inequalities: 

Q>\Cn  ( A(r )  U  B(s))  \  >  |C|  +  | A(r)  U  JB(s)|  -  n2 

Applying  (10.3)  to  j  A(r)  U  _£? (s) |  yields  the  following  lower  bound  on  Q\ 

Q  >  \C\  +  |A(r)|  +  |.B(s)|  -  | A(r)  n  B(a)|  -  n2  (10.6) 

But  \C\  =  |Yi|,  |  A(r)|  =  |  A|,  |_B(s)|  =  |i3|,  and  |A|  +  |S|  =  \X\  |.  We  now  show  that 
there  are  values  for  r  and  s  such  that  |  A(r)  H  B(s)  \  is  at  most  \A\\B\/n2. 

Consider  the  following  sum: 

n  n 

S  =  \A(r)nB(s)\ 

r—  1  s=l 

Since  A(r)  and  B(s)  are  formed  by  the  rth  and  sth  cyclic  shift  of  columns  of  A  and  rows 
of  B  respectively,  each  1  in  A  is  aligned  once  with  each  1  in  B.  It  follows  that 

S=  \A\\B\ 

As  a  consequence,  there  are  some  r  and  s  such  that  |  A(r)nB(s)  |  is  at  most  S/n 2 .  Applying 
this  result  in  (10.6),  we  have  the  following  lower  bound  on  Q: 


Q>  \Y\\ 


\B\  - \A\\B\/n2-n2 


Since  |Xi  |  =  |  A\  +  \B\  is  fixed,  the  above  lower  bound  on  Q  is  minimized  by  maximizing 
|A||S|  under  variation  of  | A \ .  This  maximum  occurs  when  |A|  =  | ACj | /2.  Consequently 
we  have  the  following  lower  bound  on  Q: 


Q>  |Yi|-: 


1  - 


J^ij 

2  n2 


Since  w(u,  v)  >  Q / 2  for  u  =  \X\  \  and  v  =  |Yi  |,  we  have  desired  the  conclusion.  ■ 

We  now  apply  this  result  and  Theorem  10.4.1  to  derive  a  stronger  result  for  matrix  multi¬ 
plication  than  was  obtained  earlier  using  its  (1,  In2,  n2,  n) -independence  property. 

THEOREM  1 0.5.4  Every  pebbling  strategy  for  straight-line  programs  computing  the  matrix  multi¬ 
plication  function  '■  & 2n  l— ^  Bn  for  n  X  n  matrices  requires  space  S  and  time  T  satisfying 

the  following  inequality: 

ST2  >  n6/ 3 

The  standard  algorithm  for  multiplying  n  X  n  matrices  uses  space  and  time  satisfying 

ST 2  =  48  n6 
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Proof  From  Lemma  10.5.3  we  have  that  the  matrix  multiplication  function  has  a  w(u,  in¬ 
flow,  where 


w(u,v)  >  (v  —  (In2  —  u)2/An2)/2 

Applying  Theorem  10.4.1  to  this  problem  with  b  =  3 S,  we  seek  the  largest  integer  d  such 
that  w(d,b )  <  S,  which  must  satisfy  the  bound 

(35  -  (In2  -  d)2 /An2)  /2<S 

This  implies  that  (2n2  —  d)  >  2 nV~S.  From  Theorem  10.4.1,  the  time  to  pebble  the  graph 
satisfies 

T  >  2s/Sn\n2/3S\ 

>  2sfSn(n2  —  35+  l)/35 

If  S  <  n2/27,  T  >  (16v/2ri3)/(27v/5)  or  ST 2  >  (.35 )n6.  On  the  other  hand,  since 
T  >  3 n2  just  to  pebble  inputs  and  outputs,  if  S  >  n2 / 27,  then  ST 2  >  n6 / 3.  ■ 

10.5.5  Discrete  Fourier  Transform 

The  discrete  Fourier  transform  (DFT)  is  defined  in  Section  6.7.3.  We  derive  upper  and  lower 
bounds  on  the  space-time  product  needed  to  compute  this  function. 

LEMMA  10.5.4  The  n-point  DFT  function  Fn  :  7Zn  i— >  7Zn  over  a  commutative  ring  7Z  is 
(2,  n,  n,  n/ 2)  -independent  for  n  even. 

Proof  As  shown  in  equation  (6.23),  the  DFT  is  defined  by  the  matrix-vector  product 
\wl°]a,  where  [wlJ]  is  a  Vandermonde  matrix.  To  show  that  the  DFT  function  is  (2,  n,  n, 
n/2) -independent,  consider  any  set  Yj  of  outputs  (corresponding  to  rows  of  [wlJ])  and  any 
set  Xq  of  inputs  (corresponding  to  columns)  whose  values  are  to  be  fixed  judiciously,  where 
p  =  |Xo|  +  fy  =  n/2.  We  show  that  the  outputs  in  Yj  have  at  least  1 TZ fy 1 1 /2  values  as  we 
vary  over  the  remaining  inputs. 

It  is  straightforward  to  show  that  the  submatrix  of  [wlJ  ]  defined  by  any  |  Y\  |  rows  and  any 
|  Y\  |  consecutive  columns  is  non-singular.  (Its  determinant  is  that  of  another  Vandermonde 
matrix.  Show  this  by  letting  the  row  and  column  indices  be  r\,  r^,  ■  ■  ■ ,  riy1|  and  s,  s  + 
1, . . . ,  s  +  |Yi  |  —  1,  respectively,  and  demonstrating  that  wriS  can  be  factored  out  of  the  zth 
row  when  computing  its  determinant.)  Our  goal  is  to  show  that  some  consecutive  group  of 
columns  corresponds  to  at  least  |  Yi  |/2  inputs  of  a  in  X\ . 

Divide  the  n  columns  of  [wlJ ]  into  \n/\Y\  |]  groups  of  consecutive  columns  with  \Y\ 
inputs  in  each  group  except  possibly  the  last,  which  may  have  fewer.  There  are  n  —  |A"o| 
inputs  that  may  vary.  Since  there  are  |"n/|Yi|]  groups,  by  an  averaging  argument  some  group 
contains  at  least  (n—\Xo\)  /  \n/\Y\  |]  of  these  inputs.  Since  \n/\ Yj|]  <  (n+  \  Y\  \  —  l)/|Yi  |, 
we  show  that  (n  —  |Xo|)/|"n/|Yi  |]  >  |Yi  |/2  for  p  =  n/2.  Observe  that  (n  —  |Ao|) /(n  + 

I  Y\  |  —  1)  >  1  /2  if  2 n  —  2|A0|  >  n  +  \ fy  |  —  1  or  n  >  |A0|  +  p  —  1,  which  holds  because 
|A0|  <  p  <  n/2. 

Since  the  submatrix  defined  by  k  consecutive  columns  and  any  k  rows  where  |"|  Y\  |/2]  < 
k  <  |T  |  is  non-singular,  it  follows  that  any  subset  of  |"|  Y\  |/2]  columns  has  full  rank.  Thus, 
the  submatrix  contains  a  non-singular  [| fy  |/2]  x  [|  Y\  |/2]  matrix.  When  all  inputs  outside 
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of  these  columns  are  set  to  zero,  the  [|Yi|/2]  outputs  have  \R\  values,  or  Fn  is 

(2,  n,  n,  n/2)  -independent.  ■ 

The  space-time  lower  bound  stated  below  follows  from  Corollary  10.4.1. 

THEOREM  10.5.5  To  pebble  any  straight-line  program  for  the  n-point  DFT  over  a  commutative 
ring  1Z  requires  space  S  and  time  T  satisfying  the  following: 

(S+  l)T>n2/\G 

when  n  is  even.  The  TFT  graph  on  n  =  2d  inputs  can  be  pebbled  with  space  S  and  time  T 
satisfying  the  upper  bound 

T  <  An2 /  ( S  —  log2  n)  +  n  log2  S 

Thus,  (S  +  1  )T  =  0(n2)  when  21og2  n  <  S  <  ( nj  log2  n)  +  log2  n. 

Proof  This  lower  bound  can  be  achieved  up  to  a  constant  factor  by  a  pebbling  strategy 
for  the  FFT  algorithm,  as  we  now  show.  Denote  with  F ^  the  n- point  FFT  graph  (it  has 
n  inputs),  n  =  2d.  (Figures.  6.1,  6.7,  and  10.8  show  4-point,  16-point,  and  32-point 
FFT  graphs.)  Inputs  are  at  level  0  and  outputs  are  at  level  d.  We  invoke  Lemma  6.7.4 
to  decompose  F^  at  level  d  —  e  into  a  set  of  top  2d~e  2e-point  FFT  graphs  above  the 
split,  {F^j  |  1  <  j  <  2e },  and  a  set  of  2e  2d-e-point  FFT  graphs  below  the  split, 

{F^~e)  |  1  <j<  2e},  as  suggested  in  Fig.  10.8.  In  this  figure  the  vertices  and  edges  have 
been  grouped  together  as  recognizable  FFT  graphs  and  surrounded  by  shaded  boxes.  The 
edges  between  boxes  identify  vertices  that  are  common  to  pairs  of  FFT  subgraphs. 


Figure  10.8  Decomposition  of  the  FFT  graph  F ^  into  four  copies  of  F ^  and  eight  copies 
of  F(2).  Edg  es  between  bottom  and  top  sub-FFT  graphs  are  fictitious;  they  identify  overlapping 
vertices  between  sub-FFT  graphs. 
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A  good  strategy  for  pebbling  the  vertices  of  an  FFT  graph  is  to  pebble  the  top  FFT 
graphs  |  1  <  j  <  2d~e}  individually.  The  vertices  of  a  top  FFT  graph  in  Fig.  10.8 

are  highlighted.  To  pebble  its  inputs,  which  are  output  vertices  of  FFT  graphs  below  the 
split,  it  suffices  to  pebble  the  subtrees  rooted  at  these  vertices.  (They  are  also  highlighted.) 
Such  subtrees  are  completely  balanced  binary  trees  with  2d~e  inputs.  Thus,  d —  e+ 1  pebbles 
and  2d~e+1  —  1  pebble  placements  suffice  to  place  a  pebble  on  the  root  of  one  such  subtree. 
If  these  subtrees  are  pebbled  in  sequence,  pebbles  can  be  left  on  the  inputs  to  a  2e-point  FFT 
graph  F A  above  the  split  using  at  most  2 e  +  d  —  e  pebbles  and  2e(2d-e+1  —  1)  pebble 
placements.  Since  2e  +  1  pebbles  and  e2e  pebble  placements  suffice  to  pebble  F A  level  by 
level  without  repebbling  vertices,  it  follows  that  all  instances  of  F A  above  the  split  can  be 
pebbled  using  a  total  of  T  =  2d{2d~e+l  +  e  —  1)  pebble  placements  and  S  =  2e  +  d  —  e 
pebbles. 

We  now  derive  an  upper  bound  on  T  by  deriving  upper  and  lower  bounds  on  the  value 
of  e  satisfying  S  =  2e  +  d  —  e.  Because  S  >  2e,  we  have  e  <  log2  S.  Let  eo  be  the  smallest 
integer  such  that  2e°+1  +  d  >  S.  Then,  2e°  +  d  —  eo  <  S  and  e  >  eo-  Consequently, 
2e  >  (S  —  d)/2,  from  which  we  have 

2  2d 

T  =  2d(2.d~e+l  +  e  -  1)  <  4— — -  +  2d  log2  S 

( S~d ) 

Finally,  log2  S  <  2 d/(S  —  d)  <  22d/S  when  2d  <  S  <  (2 d / d)  +  d,  from  which  the 
desired  conclusion  follows.  ■ 

10.5.6  Merging  Networks 

In  this  section  we  consider  networks  of  comparators  to  merge  two  sorted  lists.  Such  networks 
were  described  in  Section  6.8  and  an  example  was  given,  Batcher’s  ( m,p )  bitonic  merging 
network. 

A  comparator  element  computes  the  function  0  :  A2  i— »  A2  that  returns  the  maximum 
and  minimum  of  its  two  arguments,  that  is,  ®(a,  b )  =  (max(a,  b ),  min(a,  b)). 

LEMMA  I  0.5.5  Consider  a  comparator-based  merging  network  that  merges  two  sorted  lists  of  n 
distinct  elements  x  =  (xi,  x2,  ■  ■  ■ ,  xn)  ( Xi  <  a;i+i)  and  y  =  (yu  y2, . . . ,  yn)  ( Vi  <  Ui+ i) 
to  produce  the  sorted  list  z  =  (zi,  Z2,  ■  ■  ■ ,  £2™)  of2n  outputs  ( Zi  <  Zi+i).  There  must  be  r 
vertex-disjoint  paths  from  any  r  inputs  in  x  to  the  outputs  in  z  to  which  they  are  mapped  by  the 
network. 

Proof  Working  backwards  from  the  r  selected  outputs,  we  see  that  each  output  exits  from 
the  comparator  elements  to  which  it  is  attached  via  a  disjoint  path,  as  suggested  for  three 
outputs  in  Fig.  10.9.  Extending  this  argument  to  the  remainder  of  the  network  establishes 
the  result.  ■ 

We  next  show  that  inputs  can  be  given  values  to  cause  a  merging  network  to  shift  its  values 
in  a  fashion  that  permits  the  derivation  of  a  space-time  lower  bound. 

THEOREM  10.5.6  Any  straight-line  comparator-based  program  that  merges  two  sorted  lists  ofn 
elements  requires  space  S  and  time  T  satisfying 

ST  =  n(n2) 
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Figure  10.9  Movement  of  an  ordered  subset  of  the  items  through  Batcher’s  bitonic  merge 
algorithm. 


This  lower  bound  can  be  achieved  to  within  a  constant  multiplicative  factor  when  2  log2  n  <  S 
<  (n/  log2  n)  +  log2  n. 

Proof  Let  be  divisible  by  2.  Any  consecutive  n/2  inputs  in  x  can  be  shifted  to  the  middle 
n/ 2  positions  in  z  through  a  judicious  choice  of  values  for  y.  To  see  this,  observe  that  the 
first  fc  =  n  —  n/4  —  l  components  of  y,  l  <  n/2,  can  be  chosen  to  be  less  than  the  first  l 
components  of  x  with  the  remaining  n  —  k  components  of  y  chosen  to  be  larger  than  the 
first  l  +  n/2  components  of  x.  This  will  cause  elements  in  positions  Z  +  1 ,  Z  +  2, . . .  ,l  +  n/2 
to  shift  into  positions  n  —  n/4  +  l, . . .  ,n  +  n/4. 

Since  coalescing  vertices  in  a  graph  reduces  neither  the  time  nor  space  needed  to  peb¬ 
ble  it,  coalesce  input  vertices  assigned  to  x  whose  indices  are  equivalent  modulo  n/2.  By 
Lemma  10.5.5,  the  new  graph  has  n/2-vertex  disjoint  paths  between  the  new  inputs  and  the 
n/2  outputs  in  positions  Z  +  1,  Z  +  2, . .  . ,  Z  +  n/2  for  each  of  the  n/2  cyclic  permutations. 
It  follows  that  the  argument  applied  to  the  cyclic  shifting  function  (Lemma  10.5.2)  applies 
to  this  function.  Thus,  the  merging  network  computes  a  function  containing  a  subfunction 
that  is  (2,  n/2,  n/2,  n/4) -independent.  The  lower  bound  follows  from  Corollary  10.4.1. 

As  shown  in  Section  6.8,  the  graph  of  Batcher’s  bitonic  merging  network  is  an  FFT 
graph.  Thus,  the  upper  bounds  given  in  Theorem  10.5.5  apply.  ■ 

10.6  Worst-Case  Tradeoffs  for  Pebble  Games* 

In  this  section  we  show  that  degree-cZ  graphs  on  n  vertices  can  be  pebbled  with  0(n/logn) 
pebbles  (Theorem  10.7.1)  and  that  some  graphs  require  this  many  (Theorem  10.8.1).  These 
results  do  not  answer  the  question  of  how  bad  the  space-time  tradeoff  can  be  for  an  arbitrary 
graph.  To  address  this  question  we  must  make  it  precise.  Lengauer  and  Tarjan  [197]  state  it 
as  follows:  is  there  a  value  for  the  space  S,  say,  Sj(n),  such  that  for  positive  constants  C\  (cZ) 
and  c2(eZ)  if  S  <  C\(d)Sj(n),  some  graph  on  n  vertices  requires  time  superpolynomial  in 
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n  to  pebble  it,  whereas  for  S  >  c2(d)Sj(n)  all  graphs  on  n  vertices  can  be  pebbled  with  a 
polynomial  number  of  steps?  They  show  that  there  is  such  a  jump  value  for  space  and  that 
Sj(n)  =  0(n/loglogn).  Since  all  graphs  on  n  vertices  can  be  pebbled  with  0(n/logn) 
space,  their  result  shows  there  exist  graphs  on  n  vertices  that  require  time  exponential  in  n 
when  pebbled  with  this  number  of  pebbles. 

10.7  Upper  Bounds  on  Space* 

We  establish  upper  bounds  on  space  for  the  class  G(n,d)  of  directed  acyclic  graphs  on  n 
vertices  that  have  maximum  in-degree  d  and  out-degree  2.  We  limit  the  out-degree  to  2 
because  many  straight-line  programs  with  fan-out  k  >  2  (and  their  associated  DAGs)  can 
be  reorganized  so  that  each  computation  with  fan-out  k  can  be  replaced  by  a  binary  tree  of 
replicating  subcomputations  in  which  edges  are  directed  from  the  root  to  the  leaves.  This  at 
most  doubles  the  number  of  vertices  in  the  graph.  (See  Problem  10.12.) 

THEOREM  10.7.1  Let  Q(n,d)  be  graphs  with  n  vertices,  in-degree  d,  and  out-degree  2  for  d 
fixed.  Then  Smin(n,d),  the  minimum  space  needed  to  pebble  any  DAG  in  G(n,d),  satisfies 
Smin(n,d)  =  0(n/  log  n). 

Proof  Let  Emin(p,  d)  be  the  minimum  number  of  edges  in  any  graph  in  Q(n,  d)  that  re¬ 
quires  p  pebbles  in  the  pebble  game.  We  show  that  Em-ln(p,d)  >  cp\og2p  for  some 
constant  c  >  0.  From  this  it  follows  that 

P  <  2 (Emin(p,  d)  /  c)  /  \og2(Emin(p,  d)/c) 

when  p  >  2  and  Emin(p,  d)  >  2c.  (See  Problem  10.3.) 

Consider  a  graph  G  =  (V,E)  in  Q  (n,  d)  with  \E\  edges.  The  number  of  edges  incident 
on  vertices  is  2\E\.  Since  each  vertex  has  at  most  d  +  2  incident  edges,  2|if|  <  (d  +  2)|Vj 
=  (d  +  2 )n.  The  upper  bound  on  the  number  of  pebbles,  p,  follows  from  this  fact  and  the 
previous  discussion. 

Let  G  =  (V,  E)  in  G(n,  d)  require  p  pebbles.  An  edge  in  if  is  a  pair  of  vertices  ( u ,  v). 
Let  Vj  C  V  be  vertices  that  can  be  pebbled  with  p/ 2  or  fewer  pebbles.  Let  Vj  =  V  —  Vj. 
Thus,  every  vertex  in  Vj  requires  more  than  p/ 2  pebbles.  Let  £j,  i  =  1,2,  be  the  set  of 
edges  both  of  whose  endpoints  are  in  Vj.  Let  Gi  =  (Vi,  £j).  Let  A  =  E  —  (E 1  U  E2)\  that 
is,  A  is  the  set  of  edges  joining  vertices  in  Vj  and  V2. 

We  now  show  that  there  exists  a  vertex  in  G2  that  requires  more  than  p/2  —  d  pebbles 
if  the  pebble  game  is  played  on  G2  only.  Suppose  not.  Then  we  show  that  every  vertex  in  G 
can  be  pebbled  with  fewer  than  p  pebbles.  Certainly  every  vertex  in  V\  can  be  pebbled  with 
fewer  than  p  pebbles.  Consider  vertices  in  V2.  We  show  they  can  be  pebbled  with  fewer  than 
p  pebbles,  thereby  establishing  a  contradiction. 

Let  v  £  V2  be  pebbled  with  p/2  —  d  or  fewer  pebbles  when  G2  alone  is  pebbled.  In 
pebbling  v  as  part  of  the  complete  graph  G,  we  may  need  to  pebble  a  vertex  u>  £  V2  some  of 
whose  immediate  predecessors  are  in  V\.  As  we  encounter  such  vertices  u>,  advance  a  pebble 
to  each  of  w’s  predecessors  in  V\  one  at  at  time  until  all  predecessors  of  u>  are  pebbled.  After 
pebbling  a  predecessor  in  V\,  remove  pebbles  in  Vj  not  on  such  predecessors.  When  all 
of  a/s  predecessors  in  Vj  have  been  pebbled,  pebble  u>  itself  using  one  of  the  p/2  —  d  or 
fewer  pebbles  reserved  for  pebbling  on  V2.  This  strategy  uses  at  most  p/2  +  d  —  1  pebbles 
on  vertices  in  Vj,  at  most  d  —  1  for  all  but  the  last  predecessor  in  Vj  and  at  most  p/2 
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for  the  last  such  predecessor,  and  at  most  p/2  —  d  pebbles  on  vertices  in  V/,  for  a  total  of 
at  most  p  —  1.  This  is  a  contradiction.  It  follows  that  Go  requires  at  least  p/2  —  d  +  1 
pebbles  when  pebbled  alone  and  must  have  at  least  Em-ln(p/2  —  d  +  1  ,d)  edges.  Note  that 
Emin(p/2  -  d  +  1,  d)  >  Emin(p/2  -  d,  d). 

There  is  also  some  vertex  in  G\  that  requires  at  least  p/2  —  d  vertices,  as  we  show.  By 
assumption  every  vertex  in  V\  must  be  pebbled.  Suppose  that  each  can  be  pebbled  with 
p/2  —  d  —  1  pebbles.  There  must  be  a  vertex  r]  in  V2  all  of  whose  predecessors  are  in 
V\ .  (If  not,  we  can  always  move  backward  from  a  vertex  in  V2  to  one  of  its  immediate 
predecessors  in  V2,  a  process  that  must  terminate  since  the  finite  acyclic  graph  does  not  have 
a  cycle.)  Thus,  the  vertex  ty  can  be  pebbled  with  p/2  —  1  pebbles  using  the  pebbling  strategy 
described  in  the  preceding  paragraph  for  u>,  contradicting  the  definition  of  V2.  It  follows 
that  G 1  must  have  at  least  Emin(p/2  —  d,  d)  edges. 

Consider  now  the  set  of  edges  A  connecting  vertices  in  V]  and  If  |  A  \  >  P/4, 
Emin(p,  d)  >  2Emin(p/2  —  d,  d)  +  \A\  because  both  G\  and  G2  have  Emin(p/2  —  d,  d) 
edges.  If  |2l|  <  p/4,  pebbles  can  be  placed  on  the  endpoints  of  edges  of  A  in  V\  using  at 
most  p/2  +  p/4  —  1  <  3 p/4  pebbles,  with  the  strategy  for  u>  given  above.  If  we  leave  at 
most  p/4  pebbles  on  these  vertices,  3 p/4  pebbles  are  available  to  pebble  the  vertices  in  V2. 
If  V2  does  not  require  at  least  3 p/4  pebbles,  we  have  a  contradiction  to  the  assumption  that 
p  pebbles  are  needed.  Thus,  there  must  be  an  output  vertex  /i  that  requires  at  least  3p/4 
pebbles,  for  if  not,  none  of  its  predecessors  can  require  more. 

We  show  that  a  graph  requiring  at  least  3 p/4  pebbles  has  a  subgraph  with  at  least  p/ (4d) 
fewer  edges  that  requires  at  least  p/2  pebbles.  To  see  this,  observe  that  some  predecessor  of 
the  output  vertex  p  requires  at  least  3 p/4  —  d  pebbles.  Delete  p  and  all  its  incoming  edges 
to  produce  a  subgraph  with  at  least  one  fewer  edge  requiring  at  least  3 p/4  —  d  pebbles. 
Repeat  this  process  p/ (Ad)  times  to  produce  the  desired  result.  It  follows  that  Go  has  at  least 
Euiin  (p/2,  d)  +  p/ (Ad)  edges. 

Thus,  when  either  |  A\  >  p/4  or  j  A\  <  p/4,  at  least  2Emin(p/2  —  d,  d)  +p/ (Ad)  edges 
are  required,  and 


Emin(p,  d)  >  2Emin(p/2  -  d,  d)  + 

The  solution  to  this  recurrence  is  Em in  (p,  d)  >  cp  log  p  for  some  constant  c  >  1  /8d  and  a 
sufficiently  large  value  of  p.  ■ 

10.8  Lower  Bound  on  Space  for  General  Graphs* 

Now  that  we  have  established  that  every  graph  in  Q(n,d)  can  be  pebbled  with  0(n/  logn) 
pebbles,  we  show  that  for  all  n  there  exists  a  graph  G(n)  in  G(n,  d)  whose  minimum  space 
requirement  is  at  least  c^n/  log  n  for  some  constant  C5  >  0. 

The  graph  G(n)  is  obtained  from  a  recursively  constructed  graph  H(k)  on  2k  inputs  and 
2k  outputs,  n/2  <  2k  <  n,  by  adding  n  —  2k  vertices  and  no  edges.  The  graph  H(k)  is 
composed  of  two  copies  of  H(k  —  1)  and  two  copies  of  an  n-superconcentrator,  which  is 
defined  below. 

DEFINITION  10.8.1  An  n-superconcentrator  is  a  directed  acyclic  graph  G  =  (V,E)  with  n 
input  vertices  and  n  output  vertices  and  the  property  that  for  any  r  inputs  and  any  r  outputs. 


©John  E  Savage 


10.8  Lower  Bound  on  Space  for  General  Graphs' 


485 


1  <  r  <  n,  there  are  r  vertex-disjoint  paths  in  G  connecting  these  inputs  and  outputs.  (Paths  are 
vertex-disjoint  if  they  have  no  vertices  in  common.) 

For  n  =  2k  Valiant  [343]  has  shown  the  existence  of  n-superconcentrators  SC(k)  that 
have  2k  inputs,  2k  outputs,  and  c2k  edges.  Since  his  graphs  have  in-degree  greater  than  2, 
replace  vertices  with  in-degree  d  >  2  with  binary  trees  of  d  leaves,  thereby  at  most  doubling 
the  size  of  the  graph.  (See  Problem  10.12.)  This  provides  the  following  result. 

LEMMA  I  0.8. 1  For  some  constant  c  >  0  and  each  integer  k  and  n  =  2k  there  exists  an  n- 
superconcentrator  SC(k)  with  c2k  vertices. 

We  let  H( 8)  =  £'(7(8).  For  fc  >  8  we  construct  H(k  +  1)  recursively  from  two  copies 
of  H(k),  two  copies  of  SC(k),  and  extra  edges,  as  suggested  in  Fig.  10.10.  Flere  edges  are 
directed  from  left  to  right.  The  2k  output  vertices  of  the  first  (leftmost)  copy  of  SC(k)  (called 
£Ci(fc))  are  identified  with  the  2k  input  vertices  of  the  first  copy  of  H(k )  (called  H\{k)), 
the  2fe  output  vertices  of  Hl(k)  are  identified  with  the  2k  input  vertices  of  the  second  copy 
of  ff(fc)  (called  H2(k)),  and  the  2k  output  vertices  of  H2(k)  are  identified  with  the  2k  input 
vertices  of  the  second  copy  of  SC(k)  (called  SC2(k)).  In  addition,  we  introduce  2fc+1  new 
input  vertices  and  2k+1  new  output  vertices.  The  first  (topmost)  half  of  the  new  inputs  (called 
It)  are  connected  via  individual  edges  to  the  inputs  of  SC\  (fc).  The  second  (bottommost)  half 
of  the  new  inputs  (called  If  are  also  connected  via  individual  edges  to  the  inputs  of  SC\{k). 
The  new  inputs  are  connected  individually  to  the  new  outputs.  Finally,  each  output  of  SC2(k ) 
is  connected  via  individual  edges  to  two  new  output  vertices,  one  each  in  the  top  (called  Ot) 
and  bottom  half  (called  Of  of  the  new  outputs. 


Inputs  Outputs 


Figure  10.10  Agraphi7(fc  +  1 )  requiring  large  minimum  space. 
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The  graph  H(k )  has  n(k)  =  \ H{k)  |  vertices,  where  n(k)  satisfies  the  following: 

n(  8)  =  c28 

n(k  +  1)  =  2  n{k)  +  (2c  +  4)2fc 

The  solution  to  the  recurrence  is  n(k)  =  [k  —  7)c2k  +  (k  —  8)2fc+1,  as  can  be  shown  directly. 
The  graph  H(k)  is  in  Q(n{k),  2). 

Important  subgraphs  of  H(k  +  1)  have  the  superconcentrator  property,  as  we  now  show. 
This  result  is  applied  in  the  subsequent  lemma  to  derive  bounds  on  the  amount  of  space  used 
to  pebble  outputs  of  H(k  +  1). 

LEMMA  I  0.8.2  The  subgraphs  of  H(k+  1 )  on  2k  inputs  and 2k  outputs  defined  by  vertices  and 
edges  on  paths  from  either  inputs  in  It  or  inputs  in  lb  to  the  outputs  of  SC\  and  H  j  (fc)  have  the 
2k  -superconcentrator  property. 

Proof  The  superconcentrator  property  applies  to  the  outputs  of  S'Ci(fc)  by  definition.  Note 
that  the  jth  input  of  H\(k)  is  connected  to  its  jth  output  by  an  individual  edge  for  1  <j< 
2k.  Thus,  any  r  outputs  of  Hfik)  have  vertex-disjoint  paths  to  the  corresponding  inputs  of 
H\  (k).  By  the  superconcentrator  property  of  SC\(k),  there  are  vertex-disjoint  paths  from 
these  outputs  of  SCi(k)  to  any  r  of  its  inputs.  These  statements  obviously  apply  to  inputs 
in  It  and  lb-  ■ 

Our  goal  is  to  show  that  pebbling  the  graph  H[k)  requires  a  number  of  pebbles  propor¬ 
tional  to  n{k) /  log  n{k) .  To  do  this  we  establish  the  following  stronger  condition,  which 
implies  the  desired  result. 

LEMMA  10.8.3  Let  C\  =  14/256,  C2  =  3/256,  C3  =  34/256,  andc 4  =  1/256.  To  pebble  at 
leastc\2k  outputs  of  H{k)  in  any  order  from  an  initial  placement  of at  most  C22k  pebbles  requires 
there  be  a  time  interval  [fi,  ^2]  during  which  at  least  cf2k  inputs  are  pebbled  and  at  least  cT2k 
pebbles  remain  on  the  graph. 

Proof  The  proof  is  by  induction  on  k  with  fc  =  8  as  the  base  case.  For  the  base  case, 
consider  pebbling  Ci2fc  =  14  outputs  during  a  time  interval  [0,  t]  from  an  initial  placement 
of  no  more  than  C22k  =  3  pebbles. 

By  Problem  10.27  any  four  outputs  of  SC( 8)  are  connected  via  pebble-free  paths  to 
256  —  3  =  253  inputs.  At  least  one  of  these  four  outputs,  say  v,  has  pebble-free  paths  to  64 
=  [253/4]  inputs.  Let  f  1  —  1  be  the  last  time  at  which  all  64  of  these  inputs  have  pebble-free 
paths  to  v.  Let  £2  be  the  last  time  at  which  a  pebble  is  placed  on  these  64  inputs.  During  the 
time  interval  [£1,^2]  at  least  64  >  cf2k  inputs  are  pebbled  and  at  least  one  pebble  remains 
on  the  graph;  that  is,  at  least  cT2k  pebbles  remain.  This  establishes  the  base  case. 

Now  assume  the  conditions  of  the  lemma  (our  inductive  hypothesis)  hold  for  k.  We 
show  they  hold  for  k  +  1.  Assume  that  at  least  Ci2fc+1  outputs  of  H{k  +  1)  are  pebbled  in 
any  order  from  an  initial  placement  of  at  most  C22k+l  pebbles  during  a  time  interval  [ta,  h]. 

We  consider  four  cases  including  the  following  two  cases.  There  is  an  interval  [fi,  £2]  C 
[ta,  tb]  during  which  at  least  C22k  pebbles  are  always  on  the  graph  and  at  least  c$2k  outputs 
of  either  (1)  SC\(k),  or  (2)  H\  ( k )  are  pebbled.  By  Lemma  10.8.2  the  subgraph  of  H(k+ 1) 
consisting  of  paths  from  It  (and  Ifi  to  the  outputs  of  each  of  these  graphs  constitutes  a  2fe- 
superconcentrator.  This  is  the  only  fact  about  these  two  cases  that  we  use.  Without  loss  of 
generality,  we  show  the  hypothesis  holds  for  the  first  of  them. 
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The  graph  consisting  of  paths  from  inputs  in  It  to  the  outputs  of  SC\(k)  constitutes  a 
2fe-superconcentrator.  Prior  to  time  ta  there  are  at  most  C22fc+1  pebbles  on  the  graph  and 
during  the  interval  [ii,  £2]  there  are  at  least  C22k  (but  at  most  C22k+l)  pebbles  on  the  graph. 
Thus,  there  is  a  latest  time  fo  before  t\  when  there  are  at  most  C2 2k+1  pebbles  on  the  graph. 
Since  c32fc  >  C22k+1  +  1  outputs  of  SC\(k)  are  pebbled  in  the  interval  (and  in 

the  interval  [to,t2]),  by  Problem  10.27  at  time  t0  there  are  at  least  2k  —  C22fc+1  >  c32fc 
inputs  in  It  (and  in  Ik)  that  are  connected  by  pebble-free  paths  to  the  pebbled  outputs  of 
SC\(k).  Thus,  at  least  c3 2fe+1  inputs  in  It  and  It,  are  connected  via  pebble-free  paths  to  the 
pebbled  outputs  of  SC\(k).  In  [to,t\  —  1]  there  are  at  least  C22k+l  pebbles  continuously 
on  the  graph,  whereas  there  are  at  least  C22k  pebbles  during  [t\ ,  ^2]  •  Since  C22k  >  ck2k+l , 
the  number  continuously  on  the  graph  in  [t\,  £2]  is  at  least  C42fc+1  and  we  have  the  desired 
conclusion  for  H(k  +  1). 

In  the  third  case,  there  is  an  interval  [t  1 ,  f 2]  Q  [i a ,  h]  during  which  at  least  C\2k  outputs 
of  the  full  graph  H(k+ 1 )  are  pebbled  and  at  least  C22k  pebbles  are  always  on  the  graph.  This 
implies  that  during  [ii,  £2]  either  C\2k /2  outputs  in  Ot  or  in  Ob  are  pebbled,  which  in  turn 
implies  that  at  least  C\2k /2  outputs  of  SC2(k)  are  pebbled.  Since  C\2k /2  >  C22k+l  +  1 
(at  most  c22fc+1  pebbles  are  on  H (k  +  1)),  it  follows  from  Problem  10.27  that  at  least 
2k  —  c22k+l  >  c3 2k  inputs  in  It  (or  Ik)  are  connected  via  pebble-free  paths  to  the  pebbled 
outputs  of  SC2(k).  The  total  number  of  such  inputs  is  c32fc+1.  Since  C22fc  >  C42fc+1,  there 
are  at  least  C42fe+1  pebbles  on  the  graph  continuously  during  [t\,  £2]  and  we  have  the  desired 
conclusion. 

In  the  fourth  case  none  of  the  previous  cases  hold.  Since  C\  2fc+1  outputs  of  H{k  +  1) 
are  pebbled  during  \ta,tk\,  there  is  an  earliest  time  1 1  £  [ta,tk\  such  that  C\2k  outputs  of 
H{k  +  1)  are  pebbled  in  the  interval  [ta,t\  —  1].  Since  the  third  case  does  not  hold,  there 
is  a  time  t2  <  t\  such  that  fewer  than  C22k  pebbles  are  on  the  graph  at  <2  —  1  and  at  least 
C\2k  outputs  of  H ( k  +  1 )  are  pebbled  in  the  interval  [t2 ,  t &] .  It  follows  that  at  least  C\ 2k /2 
outputs  of  SC^fc)  are  pebbled  during  this  interval.  Since  C\2k /2  >  C22k  +  1,  it  follows 
from  Problem  10.27  that  at  least  2k  —  C22k  >  c3 2k  inputs  to  (which  are  outputs  to 

Hi(k))  are  connected  via  pebble-free  paths  to  the  pebbled  outputs  of  iS'Ci(fc)  and  must  be 
pebbled  during  [^2,  tk\.  Since  c3 2k  >  Ci2fc,  by  the  inductive  hypothesis  there  is  an  interval 
[td>  te]  C  [t2,  tb]  during  which  at  least  c3 2k  inputs  of  H2(k)  (which  are  outputs  of  H\  {k)) 
are  pebbled  and  ck2k  pebbles  reside  continuously  on  H2(k). 

Since  the  second  case  does  not  hold,  by  an  argument  paralleling  that  given  in  the  pre¬ 
ceding  paragraph  there  must  be  a  time  f3  £  [td,te]  such  that  at  most  c32fc/2  outputs  of 
Hi(k)  are  pebbled  during  [td,  t$  —  1]  and  fewer  than  C22k  pebbles  reside  on  H(k  +  1)  at 
tc  —  1.  Thus,  during  [f3,  te]  at  least  c32fe/2  >  C\2k  outputs  of  H\{k)  are  pebbled  from 
an  initial  configuration  of  fewer  than  C22k  pebbles.  By  the  inductive  hypothesis  there  is  an 
interval  [tf,  tg]  C  [t3,  te ]  during  which  at  least  c3 2k  inputs  of  H\  ( k )  (which  are  outputs  of 
SC\(k ))  are  pebbled  and  ck2k  pebbles  reside  on  f?i(fc)  continuously. 

Since  the  first  case  does  not  hold,  again  paralleling  an  earlier  argument  there  must  be  a 
time  t4  £  [tf,  tg]  such  that  at  most  c3  2fc/2  outputs  of  SC\(k )  are  pebbled  during  [tf ,  £4—  1] 
and  fewer  than  C22k  pebbles  reside  on  H(k  +  1)  at  £4  —  1.  Thus,  during  [t4,tg\  at  least 
c32fe/2  >  c22fc  +  1  outputs  of  SCi(k)  are  pebbled  from  an  initial  configuration  of  fewer 
than  c22fe  pebbles.  By  Problem  10.27  at  least  2k  —  C22k  >  c3 2k  inputs  of  SC\{k)  are 
connected  via  pebble-free  paths  to  the  pebbled  outputs.  Thus  at  least  c3 2k  corresponding 
inputs  in  both  It  and  lb  must  be  pebbled  for  a  total  of  at  least  c32fc+1  inputs. 
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Since  at  least  C4lk  pebbles  reside  continuously  on  both  Hl(k)  during  [td,te\  and  on 
TL2(fc)  during  [tf,tg\  and  [tf,tg]  C  [td,te],  it  follows  that  Ci2k  +  C4 2fc  =  C42fc+1  reside 
continuously  on  H(k  +  1)  during  [tf,tg\.  ■ 

We  are  now  ready  to  show  the  existence  of  a  graph  on  n  vertices  that  requires  u>(n/  log  n ) 
minimal  space. 

THEOREM  1 0.8. 1  For  integers  n  >  1  there  exists  a  graph  G(n)  in  Q{n,  d)  that  requires  mini- 
mum  space  Smin(G(n))  >  c^n/  log  n  for  some  constant  C5  >  0. 

Proof  For  n  >  28,  let  k  be  the  largest  integer  such  that  n{k )  <  n ;  that  is,  n{k )  <  n  < 
n(k  +  1 ) .  Construct  the  graph  G(n )  by  adding  n  —  n(k )  vertices  and  no  edges  to  the  graph 
H{k).  An  optimal  pebbling  strategy  for  G(n)  pebbles  the  added  vertices  one  at  a  time  using 
one  pebble,  after  which  H{k)  is  pebbled.  From  Lemma  10.8.3  it  follows  that  pebbling 
H ( k )  requires  at  least  C4 2k  pebbles,  since  at  least  this  many  must  reside  on  the  graph  at  one 
time.  Since  n(k  +  1)  <  4n(k)  for  k  >  8  and  c  >  2,  it  follows  that  n/4  <  n(k)  <  n.  This 
implies  that  2k  <  n  and  k  <  log2  n  and  that  n/4  <  fc(c  +  2)2k  <  (log2  n)(c  +  2)2k . 
From  this  we  have  2k  >  c^n/  log2  n,  where  C5  —  1  /(4c  +  8).  The  conclusion  follows  by 
observing  that  at  least  (c4cfn/  log2  n  pebbles  are  needed  to  pebble  G(n).  ■ 

10.9  Branching  Programs 

The  general  branching  program  is  a  serial  computational  model  that  permits  data-dependent 
computation,  unlike  the  pebble  game.  A  branching  program  is  a  directed  graph  consisting  of 
a  single  starting  vertex  and  in  which  vertices  are  labeled  with  predicates.  Each  vertex  has  one 
outgoing  edge  for  each  value  of  its  predicate.  (See,  for  example,  Figs.  10.11  and  10.12.)  Time 
in  this  model  is  the  number  of  queries  performed,  and  computations  other  than  queries  are 
not  counted.  The  space  used  by  a  branching  program  is  the  base-2  logarithm  of  the  number 
of  vertices  in  its  graph.  Lower  bounds  on  space  and  input  time  obtained  with  the  branching 
program  apply  to  within  constant  multiplicative  factors  to  the  pebble  game  and  the  RAM 
model.  (See  Section  10.9.1.) 

As  noted  in  Section  10.1.1,  since  the  branching  program  reads  inputs  in  a  less  constrained 
manner  than  the  straight-line  program,  it  may  be  possible  to  solve  some  problems  with  branch¬ 
ing  programs  using  less  space  or  time  than  in  the  pebble  game.  As  a  consequence,  space-time 
lower  bounds  for  branching  programs  may  be  smaller  than  for  the  pebble  game.  Thus,  if  a 
problem  is  going  to  be  solved  with  straight-line  programs,  such  as  an  algebraic  circuit,  it  is  bet¬ 
ter  to  use  lower  bounds  derived  with  the  pebble  game  unless  the  branching  program  gives  the 
same  lower  bounds.  In  particular,  branching  programs  give  smaller  space-time  lower  bounds 
for  integer  multiplication  and  shifting  (see  Section  10.13.2)  than  does  the  pebble  game. 

We  examine  two  kinds  of  branching  programs  in  this  section,  general  branching  programs 
and  decision  branching  programs. 

DEFINITION  I  0.9. 1  A  multigraph  is  a  graph  that  may  have  more  than  one  edge  between  two 
vertices.  A  directed  multigraph  is  a  multigraph  in  which  each  edge  has  a  direction.  A  directed 
acyclic  multigraph  (DAM)  is  a  multigraph  with  no  directed  cycles.  A  rooted  directed  acyclic 
multigraph  is  a  multigraph  with  a  root  vertex,  a  vertex  with  no  edges  directed  into  it,  and  is  such 
that  every  vertex  can  be  reached  via  some  path  from  the  root.  A  sink  vertex  has  no  edges  directed 
away  from  it. 
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A  branching  program  V  with  input  variables  x  over  the  set  A  and  output  variables  y  over 
the  set  IF  is  a  rooted  directed  acyclic  multigraph  that  has  a  query  q{x)  associated  with  each  vertex 
except  for  sink  vertices  and  has  a  query  outcome  associated  with  every  edge  directed  away  from  a 
vertex.  Each  edge  may  also  carry  as  a  label  the  values  of  some  output  variables,  with  the  proviso  that 
each  output  variable  is  assigned  exactly  one  value  along  any  one  path  from  the  root  to  a  sink  vertex. 

The  decision  branching  program  is  a  special  kind  of  branching  program  in  which  the 
queries  q{x)  compare  two  variables  and  produce  either  the  two  outcomes  {<,  >}  or  the  three 
outcomes  {<,  =,  >}.  Figure  10.1 1  shows  an  example  of  a  decision  branching  program  that 
merges  two  2-element  sorted  lists  ( U\,U2 )  and  (fi,©  (ui  <  W2  and  Ui  <  V2 )  by  using 
queries  that  compare  the  values  of  two  input  variables.  Each  vertex  in  the  example  has  two 
out-directed  edges  corresponding  to  the  results  of  the  query.  The  outputs  appear  in  sorted 
order  along  a  path  from  the  root  to  a  leaf. 

A  decision  tree  is  a  decision  branching  program  whose  DAM  (directed  acyclic  multigraph) 
is  a  tree.  A  decision  tree  may  be  constructed  for  a  sequential  comparison-based  sorting  algo¬ 
rithm,  such  as  Batcher’s  odd-even  merging  algorithm  of  Section  6.8,  by  associating  the  first 
comparison  with  the  root,  the  second  comparisons  with  the  roots  of  the  left  and  right  subtrees, 
etc. 

DEFINITION  10.9.2  A  computation  on  a  branching  program  'P  is  a  traversal  of  the  unique 
path  in  the  DAM  from  the  root  to  a  leaf  determined  by  the  values  of  the  input  variables  in  x  = 
(x\ ,  X2,  ■  ■  ■ ,  xn)  over  the  set  A.  The  output  of  the  computation  is  the  sequence  of  output  values 
iny  =  (j/i,  2/2>  •  ■  ■  >  Vm)  over  the  set  F  encountered  on  the  edges  of  the  path  traversed. 

A  function  f^n’  :  An  1— >  F  m  with  input  variables  in  x  and  output  variables  in  y,  namely 

uX2,...,Xn)  =  (j/l,  J/2,  ■•■,2/to) 


Figure  I  0. 1  I  A  decision  branching  program  that  merges  the  lists  (ui,  U2)  and  (vi,  V2)  when 
Mi  <  112  and  Vi  <  V2. 
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is  computed  by  V  if for  each  value  of  x  the  correct  value  of  each  output  variable  appears  exactly 
once  on  each  path  from  the  root  to  a  leaf. 

The  time  associated  with  a  computation  is  the  length  of  the  path  traversed  by  the  computa¬ 
tion.  The  computation  time  T  of  a  branching  program  is  the  length  of  its  longest  path. 

In  Fig.  10.1 1  the  computation  associated  with  the  input  values  {u\,U2,  Vi,V2)  =  (2,  4, 1, 
3)  takes  the  right  branch  out  of  the  root  and  produces  the  output  value  Vi  =  1,  takes  the  left 
branch  at  the  next  vertex  and  produces  u i  =  2,  and  takes  the  right  branch  at  the  last  vertex 
and  produces  V2  =  3  and  U2  =  4.  The  output  of  this  computation  is  the  sorted  sequence 
1,2,  3,  4,  as  expected.  This  branching  program  merges  the  two  sorted  lists.  Each  sink  vertex 
corresponds  to  one  of  the  four  ways  of  merging  the  two  lists.  The  computation  time  of  this 
branching  program  is  3. 

Branching  programs  that  compare  elements  at  vertices  are  well  suited  to  merging  and  sort¬ 
ing  but  are  not  of  the  most  general  type. 

DEFINITION  10.9.3  A  general  branching  program  P  with  input  variables  x  over  a  finite  set 
A  has  a  query  of  the  form  x  j  =  ?  associated  with  a  variable  Xt  at  each  vertex.  It  also  has  one  edge 
directed  away  from  the  vertex  for  each  value  ofxi-  A  general  branching  program  is  non-redundant 
if  along  each  path  from  the  root  to  a  leaf  a  query  xt  =  ?  appears  at  most  once. 

The  general  branching  program  is  also  known  as  a  binary  decision  diagram  (BDD).  BDD’s 
are  widely  used  in  the  computer-aided  design  (CAD)  of  circuits  for  Boolean  functions. 

A  general  branching  program  that  convolves  two  short  binary  sequences  over  the  integers 
is  shown  in  Fig.  10.12.  (Convolution  is  defined  in  Section  6.7.4.)  A  computation  leaves  the 
left  branch  of  a  vertex  when  the  associated  variable  has  value  0  and  the  right  branch  when  it 
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has  value  1 .  This  branching  program  computes  the  convolution  c  =  a  ®  b  of  the  sequences 
a  =  (ao,  a i)  and  b  =  (bo,  b i);  that  is, 

Co  =  CLoba,  c\  =  agbi  +  aibo,  C2  =  a\b\ 

The  performance  of  a  branching  program  is  also  measured  by  its  space  complexity. 

DEFINITION  I  0.9.4  The  space  used  by  branching  program  V  is  the  base-2  logarithm  of  the  num¬ 
ber  of  vertices  in  its  directed  acyclic  multigraph. 

As  shown  in  the  next  section,  this  definition  permits  a  lower  bound  on  the  space  complexity 
used  by  any  reasonable  general-purpose  computer  model  equipped  with  a  random-access  read¬ 
only  memory  for  its  input  data. 

The  following  lemma  demonstrates  that  every  decision  branching  program  can  be  simu¬ 
lated  by  a  general  branching  program,  thereby  showing  the  latter  to  be  more  general  than  the 
former.  (See  Problem  10.35.) 

LEMMA  10.9.1  Every  decision  branching  program  with  variables  over  a  finite  set  A  with  com¬ 
putation  time  T  and  space  S  can  be  simulated  by  a  general  branching  program  with  computation 
time  2  T  and  space  S  +  log(|.4|  +  1). 

This  result  is  proved  by  constructing  a  general  branching  program  to  simulate  a  comparison 
operator  and  substituting  it  for  the  comparison  operator  in  a  decision  branching  program.  (See 
Problem  10.35.)  The  graph  that  results  from  this  construction  is  explicitly  a  multigraph. 

While  Lemma  10.9.1  establishes  that  decision  branching  programs  are  no  more  powerful 
than  general  branching  programs,  this  does  not  imply  that  general  branching  programs  require 
less  space.  In  fact,  the  space  complexity  of  a  given  decision  branching  program  is  independent 
of  the  size  of  the  set  A  over  which  the  variables  are  defined;  this  is  not  true  for  general  branching 
programs. 

If  space  complexity  is  not  an  issue,  a  tree  program  can  be  constructed.  This  is  a  branch¬ 
ing  program  whose  DAM  is  a  tree.  The  following  recursive  procedure  converts  a  branching 
program  to  a  tree  program:  a)  If  any  immediate  descendant  of  the  root  has  more  than  one  edge 
directed  into  it,  make  as  many  copies  of  the  submultigraph  rooted  at  that  descendant  as  there 
are  entering  edges  and  direct  exactly  one  edge  into  each,  b)  Apply  this  procedure  recursively  to 
each  of  the  submultigraphs  until  leaf  vertices  are  reached.  This  procedure  does  not  change  the 
length  of  any  path  in  the  original  DAM  or  the  computation  time. 

The  notions  of  space  and  time  can  be  generalized  to  average  time  and  space  when  a  prob¬ 
ability  distribution  is  defined  on  input  values.  (See  Problem  10.37.) 

Below  we  present  a  key  lemma  used  to  derive  lower  bounds  on  the  space-time  product. 
This  lemma  is  stated  for  normal-form  branching  programs,  general  branching  programs 
whose  DAMs  are  level  multigraphs,  that  is,  multigraphs  in  which  each  vertex  has  a  level  and 
adjacent  vertices  are  in  adjacent  levels.  An  example  of  such  a  graph  is  shown  in  Fig.  10.13. 

LEMMA  I  0.9.2  If  there  is  a  general  branching  program  of  space  S  and  computation  time  T  for  a 
function  f,  then  there  is  a  normal-form  branching  program  for  f  that  has  space  2S  and  computation 
time  T. 

Proof  To  convert  a  general  branching  program  to  a  normal-form  branching  program,  create 
T  +  1  copies  of  the  general  branching  program,  one  for  each  time  step  including  the  zeroth. 
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001  110  010  000  011  101  111  100 

Figure  10.13  A  normal-form  tree  program  for  table  lookup.  It  has  one  path  for  each  value  of 
the  input. 


Delete  the  original  edges  and  add  an  edge  from  vertex  u  in  the  *th  copy  to  vertex  v  in  the 
(i  +  l)st  copy  if  there  was  an  edge  between  u  and  v  in  the  original  graph.  Now  delete  all 
edges  and  vertices  that  are  not  reached  from  the  root  of  the  zeroth  branching  program.  (See 
Fig.  10.14.) 

This  procedure  increases  the  number  of  vertices  by  at  most  a  factor  of  T,  thereby  in¬ 
creasing  the  space  by  adding  at  most  log  T.  However,  a  branching  program  with  space  S 
has  2s  vertices.  Thus,  the  length  of  the  longest  path  through  the  program  T  cannot  exceed 
2s,  or  S  -blog  T<  25.  ■ 

Generally  the  space  S  used  for  a  branching  program  computation  will  be  large  by  com¬ 
parison  with  log  T,  in  which  case  the  space  bounds  for  normal-form  branching  programs  and 
general  branching  programs  will  differ  by  at  most  a  constant  factor. 

In  the  rest  of  this  chapter  when  we  speak  of  a  branching  program  we  mean  a  general 
branching  program. 


Figure  10.14  Construction  of  a  normal-form  general  branching  program  as  a  level  multigraph. 
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We  close  this  section  by  describing  a  normal-form  tree  program  for  table  lookup,  an 
important  programming  tool  that  can  be  used  to  compute  an  arbitrary  function  /(")  :  An 
Am  on  n  variables  whose  value  is  an  m-tuple.  Each  of  the  n  variables  is  read  and  the  value  of 
the  function  is  found  in  a  table.  This  is  simulated  by  a  tree  program  with  branching  factor  Tl 
in  which  the  variables  are  read  in  succession  until  they  are  all  read,  at  which  point  the  value  of 
the  function  is  provided.  An  example  of  such  a  tree  program  for  a  function  / ^  :  B 3  i— >  £>3 
is  shown  in  Fig.  10.13.  There  is  one  path  through  the  tree  for  each  of  the  possible  \A\n 
assignments  to  the  n  inputs.  The  sink  vertices  are  labeled  by  the  appropriate  m-tuple.  Such 
table-lookup  tree  programs  have  computation  time  n  and  space  proportional  to  n  log  |„4|  since 
they  have  (|.4.|n+1  —  1)/(|«4|  —  1)  vertices  with  A  edges  per  vertex  except  for  those  at  the  lowest 
level. 


10.9.1  Branching  Programs  and  Other  Models 

We  begin  this  section  with  a  comparison  of  branching  programs  and  pebble  games  and  con¬ 
clude  with  a  brief  comparison  of  branching  programs  and  the  RAM  model  of  computation. 

The  pebble  model  assumes  that  computation  is  serial  and  straight-line.  If  all  algorithms 
used  for  a  particular  problem  are  of  this  type,  the  pebble  game  is  the  appropriate  model,  es¬ 
pecially  if  the  lower  bounds  on  space-time  exchanges  are  larger  than  those  provided  by  the 
branching  program  model.  (All  algorithms  used  today  for  integer  multiplication  are  straight- 
line  and  the  lower  bounds  on  the  space-time  product  for  this  problem  are  larger  with  the 
pebble  game  than  with  the  branching  program  model.)  If  the  two  models  give  the  same  lower 
bounds,  then  we  can  invoke  Lemma  10.9.3  to  derive  lower  bounds  on  the  space-time  ex¬ 
changes  for  pebbling  from  those  for  branching  programs  when  log2  Tp  is  small  by  comparison 
with  S-p,  where  Tp  and  Sp  are  the  time  and  space  used  by  the  pebbling  model. 

Data-dependent  reading  of  inputs  may  allow  the  branching  program  to  perform  a  com¬ 
putation  more  quickly  than  the  pebbling  model.  For  example,  merging  requires  a  space-time 
product  that  is  quadratic  in  the  length  of  the  input  strings  with  the  pebble  game  but  only 
linear  in  the  branching  program.  (See  Section  10.10.2.)  This  demonstrates  that  the  branching 
program  is  a  much  more  natural  model  for  this  problem. 

If  the  lower  bounds  derived  with  the  branching  program  are  comparable  in  strength  to 
those  offered  by  the  pebbling  model,  as  is  true  for  most  of  the  problems  considered  in  this 
chapter,  straight-line  programs  are  the  better  model  for  these  problems.  But  the  extra  flexibility 
offered  by  branching  programs  means  that  when  their  results  are  comparable  to  those  provided 
by  the  pebble  game,  one  must  work  harder  to  obtain  them.  (See  Sections  10.11  and  10.12.) 

The  branching  program  measures  the  time  to  read  inputs  but  ignores  the  time  for  com¬ 
putations  and  the  production  of  outputs.  By  contrast,  the  pebble  game  measures  the  time  to 
read  inputs,  perform  computations,  and  produce  outputs.  Although  the  time  for  computations 
generally  cannot  be  ignored,  the  methods  available  today  to  derive  lower  bounds  for  both  mod¬ 
els  are  based  on  the  time  spent  reading  inputs.  But  while  for  many  problems  the  time  to  read 
inputs  dominates  computation  time  for  many  values  of  space,  when  space  is  large  the  pebbling 
model  has  the  potential  to  give  larger  lower  bounds  than  the  branching  program  model.  For 
example,  no  way  is  known  to  compute  the  n-point  DFT  with  fewer  than  0(nlogn)  steps, 
the  number  used  by  the  FFT  algorithm,  although  in  the  limit  of  large  space  the  branching 
program  gives  a  lower  bound  on  space  proportional  to  n. 

To  simulate  the  pebbling  of  a  DAG  by  a  branching  program  we  must  give  an  interpreta¬ 
tion  to  each  vertex  of  the  DAG:  assign  an  operation  to  each  non-input  vertex  and  a  variable  as 
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well  as  values  to  each  input  vertex.  Two  different  interpretations  of  a  DAG  may  yield  different 
branching  programs.  Of  course,  a  DAG  is  pebbled  without  regard  to  the  interpretation  of  ver¬ 
tices:  the  pebble-game  lower  bounds  use  only  the  fact  that  vertices  can  hold  one  of  |^4|  values 
and  do  not  depend  explicitly  on  the  interpretation  given  to  their  operator. 

LEMMA  10.9.3  Given  a  pebbling  V  of  an  interpreted  directed  acyclic  graph  G  that  uses  Sp 
pebbles  and  Tp  input  steps  to  compute  a  function  with  operations  over  a  finite  set  A,  there  is  a 
branching  program  with  space  Sp  log  |-4|  +  log  (2  Tp)  and  time  Tp  that  computes  the  function 
computed  by  G.  Thus,  if  2 Tp  <  |^4|s'p,  simidtaneous  lower  bounds  on  the  space  and  time  for 
a  branching  program  for  the  function  imply  simultaneous  lower  bounds  on  space  and  time  in  the 
pebble  game  that  differ  by  at  most  constant  multiplicative  factors. 

Proof  We  construct  a  branching  program  Q  to  simulate  the  pebbling  V  of  a  directed  acyclic 
graph  that  uses  Sp  pebbles  and  Tp  steps.  (Figure  10.15  illustrates  the  construction  of  such 
a  branching  program.)  Initially  the  branching  program  has  a  single  vertex,  the  root,  which 
is  labeled  with  the  first  variable  to  be  pebbled  according  to  V.  Advance  the  first  pebble  as 
far  as  possible.  Create  a  vertex  in  the  branching  program  for  each  value  of  the  operation 
or  input  covered  by  the  first  pebble.  Label  these  new  vertices  with  the  name  of  the  second 
input  to  be  pebbled  and  attach  an  edge  from  the  root  vertex  to  these  new  vertices  labeled 
with  the  corresponding  value  for  the  first  input.  Advance  pebbles  as  far  as  possible  according 
to  V  and  create  one  new  vertex  in  the  branching  program  for  each  different  tuple  of  values 


(a)  (b) 


Figure  10.15  A  general  branching  program  (b)  that  simulates  the  pebbling  of  a  DAG  (a)  in  the 
vertex  order  1,  2,  4,  3,  5,  6,  7.  The  DAG  input  variables  are  denoted  u,  v,  w,  and  x  and  assume 
values  in  {0,  1}.  +  denotes  OR  and  *  denotes  AND. 
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residing  under  the  pebble(s)  currently  on  the  DAG.  (In  the  example  of  Fig.  10.15,  after 
placing  a  pebble  on  the  second  vertex  we  advance  a  pebble  to  the  third  vertex  and  remove 
all  other  pebbles.  Thus,  only  two  vertices  are  added  to  the  branching  program  at  this  step.) 
Label  the  new  vertices  with  the  third  input  to  be  pebbled.  Now  repeat  the  above  process 
by  advancing  pebbles  as  far  as  possible  (in  the  example,  pebbles  now  reside  on  the  third  and 
fourth  vertices),  add  one  new  vertex  for  each  tuple  of  pebbles  on  the  DAG  (four  vertices  are 
added),  and  connect  edges  from  the  previous  to  the  current  set  of  new  vertices  that  conform 
to  the  values  assumed  at  the  vertices  of  the  DAG.  This  process  is  repeated  until  all  inputs 
have  been  pebbled. 

Since  the  values  of  operations  are  always  determined  by  the  values  under  at  most  Sp 
pebbles,  the  number  of  new  vertices  added  in  Q  with  the  pebbling  of  each  new  input  vertex 
in  G  is  most  |^4| 3v .  Since  Tp  input  vertices  of  G  are  pebbled,  it  follows  that  Q  has  at  most 
Tp\A\Sv  +  1  <  2Tp\A\Sv  vertices,  from  which  the  conclusion  follows.  ■ 

A  branching  program  can  also  simulate  a  computation  by  a  general  model  of  computation, 
such  as  the  RAM  discussed  in  Section  3.4,  as  we  now  show.  Let  the  RAM  have  M  6-bit  words 
of  memory  and  a  finite  number  of  6-bit  words  in  its  CPU.  Consider  any  program  for  such  a 
machine.  Its  state  is  determined  by  the  values  in  its  registers  and  memory  locations.  Thus  the 
RAM  has  at  most  0(2Mb)  states.  Let  the  space  used  by  a  RAM  be  the  base-2  logarithm  of 
the  number  of  its  states.  Let  the  RAM  execute  Tram  steps  to  read  its  inputs.  We  simulate 
this  computation  in  the  same  fashion  as  with  the  pebble  game.  After  reading  an  input  variable, 
the  branching  program  enters  one  of  at  most  0(2Mb)  vertices  corresponding  to  states  of  the 
RAM.  Since  the  RAM  reads  inputs  on  Tram  steps,  the  branching  program  also  takes  Tram 
steps  and  has  at  most  0(Tram2M6)  vertices  or  uses  space  of  at  most  0(Mb  +  log  Tram)- 
As  long  as  Mb  is  larger  than  some  multiple  of  log  Tram,  simultaneous  lower  bounds  on  the 
time  to  read  inputs  and  space  of  a  branching  program  for  a  function  computed  by  the  RAM 
serve  as  lower  bounds  on  the  same  quantities  on  the  RAM.  The  following  lemma  summarizes 
this  discussion. 

LEMMA  10.9.4  Given  a  RAM  program  that  uses  space  Sram  and  Tram  input  steps  to  compute 
f  :  An  i— >  Am  there  is  a  branching  program  with  space  0(Sram  +  log  (2Tram))  and  time 
Tram  that  computes  f.  Thus,  if  2Tram  <  2Sram,  simidtaneous  lower  bounds  on  the  space  and 
time  for  a  branching  program  for  the  function  imply  simidtaneous  lower  bounds  on  the  space  and 
time  on  the  RAM  that  differ  by  at  most  constant  midtiplicative  factors. 

10.10  Straight-Line  Versus  Branching  Programs 

In  this  section  we  show  that  some  problems  can  use  space  and  time  more  efficiently  with 
branching  programs  than  they  can  with  the  pebble  game.  We  demonstrate  this  for  the  cyclic 
shifting  function  /c"c\ic  :  £>”+riogral  introcjuce(j  ;n  Section  2.5.2  and  the  merging 

problem  introduced  in  Section  6.8.  However,  for  all  of  the  other  problems  studied  in  this 
chapter  the  lower  bounds  obtained  with  these  two  models  are  the  same  up  to  constant  mul¬ 
tiplicative  factors,  except  for  integer  multiplication,  where  the  branching  program  bound  is 
smaller  by  a  factor  of  log2  n. 

It  is  important  to  note,  however,  that  the  superiority  of  branching  programs  arises  from 
the  assumption  that  inputs  can  be  read  in  a  data-dependent  fashion,  an  assumption  that  is 
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not  available  to  straight-line  programs.  As  we  know  from  Problem  10.20,  if  branching  is 
allowed  but  inputs  must  be  read  in  a  data-independent  fashion  by  an  input-output-oblivious 
finite-state  machine,  Theorem  10.4.1  applies.  Thus,  branching  programs  that  read  inputs  in 
a  data-independent  fashion  have  no  advantage  over  straight-line  programs,  at  least  in  terms  of 
lower  bounds  on  space-time  exchanges. 

10.10.1  Efficient  Branching  Programs  for  Cyclic  Shift 

We  present  a  branching  program  for  /c”c\ic  that  uses  space  S  =  0(logn)  and  time  T  = 
n  +  [logn];  that  is,  ST  =  O(nlogn),  a  product  that  is  much  less  than  the  @(n2)  product 
required  in  the  pebble  game.  (See  Section  10.5.2.) 

The  function  /(|”jlic  has  n  +  [log  n]  Boolean  variables,  [log  n]  control  inputs,  and  n 
“value”  inputs  whose  values  are  shifted  by  the  amount  specified  by  the  control  inputs.  Our 
efficient  branching  program  is  a  tree  program  (see  Fig.  10.13)  that  reads  the  control  inputs 
and  selects  one  of  n  paths  through  the  tree.  (Note  that  n  <  <  In)  Each  path 

corresponds  to  one  of  the  n  possible  cyclic  shifts  of  the  n  value  inputs.  Attached  to  a  leaf  of 
this  tree  is  a  chain  of  vertices,  one  per  value  input.  These  inputs  appear  in  the  order  specified 
by  the  cyclic  shift  associated  with  the  path.  An  input  value  is  read  and  then  produced  as  output 
at  each  of  these  n  vertices.  Since  this  branching  program  has  at  most  2 n  +  2 n2  vertices,  it  has 
space  0(log  n).  It  uses  time  n  +  [log  n\ . 

If  cyclic  shifting  is  to  be  done  by  a  straight-line  program,  say  in  hardware,  then  it  is  better  to 
use  the  pebble  game  for  lower  bounds  since  this  model  applies  to  logic  circuits  and  the  results 
it  provides  are  stronger.  However,  if  the  problem  is  to  be  executed  in  software,  the  branching 
program  should  be  used  unless  the  program  is  straight-line. 

10.10.2  Efficient  Branching  Programs  for  Merging 

Consider  now  the  merging  problem.  In  Section  10.5.6  we  show  that  it  requires  an  il(n2) 
space-time  product  where  n  is  the  size  of  the  input.  However,  when  executed  by  a  branching 
program  it  uses  space  0(log  n)  and  time  O(n),  as  we  show. 

Figure  10.11  shows  a  “pyramid”  decision  branching  program  to  merge  two  sequences  of 
length  two.  It  is  straightforward  to  extend  this  decision  branching  program  to  sequences  of 
length  n,  as  suggested  in  Fig.  10.16.  In  this  figure  vertices  are  labeled  by  the  number  of 
elements  that  are  removed  from  the  two  lists  being  merged  before  arriving  at  the  vertex  carrying 
the  label.  For  example,  prior  to  arriving  at  the  vertex  labeled  (2,  1),  two  elements  have  been 
removed  from  the  left  list  and  one  from  the  right  list.  We  assume  that  the  lists  to  be  merged 
each  contain  n  elements.  Thus,  all  the  pyramid  vertices  below  a  vertex  labeled  with  {n,  k)  or 
(k,  n),  1  <  k  <  n  —  1,  are  deleted  because  below  such  vertices  no  further  comparisons  are 
needed;  the  outputs  produced  are  those  on  the  list  from  which  k  values  have  been  removed. 
Thus,  we  attach  a  chain  of  n  —  k  vertices,  one  for  each  of  the  input  values  at  the  end  of  the 
smaller  list.  If  the  root  is  at  level  1,  vertices  labeled  (n,  k)  and  (fc,  n)  are  at  level  n  +  k  +  1  < 
2n  +  1. 

The  number  of  vertices  on  level  l  of  this  decision  branching  program  is  at  most  l.  Since 
1  <  l  <  2 n,  it  has  at  most  (  =  (n  +  l)(2n  +  1)  vertices.  The  space  associated  with 

this  program  is  0(log(n  +  l)(2n  +  1)).  Since  the  length  of  the  longest  path  in  this  program 
is  2 n,  it  has  time  2 n  associated  with  it.  From  Lemma  10.9.2  it  follows  that  merging  can  be 
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(0,0) 


Figure  10.16  The  top  portion  of  a  decision  branching  program  to  merge  two  sorted  lists.  The 
pair  of  integers  at  a  vertex  denotes  the  number  of  elements  removed  from  the  left  and  right  lists 
by  the  program  before  arriving  at  the  vertex  carrying  the  pair. 


realized  by  a  general  branching  program  with  space  O(logn)  +  log  \A\  and  time  O(n)  or  a 
space-time  product  that  is  0(n  log  n),  much  smaller  than  the  0(n2)  space-time  product  that 
applies  to  the  pebble  game. 

10.11  The  Borodin-Cook  Lower-Bound  Method 

In  this  section  we  generalize  the  method  of  Borodin  and  Cook  [53]  for  deriving  space-time 
lower  bounds  for  branching  programs.  The  conditions  under  which  lower  bounds  can  be 
derived  are  captured  by  a  property  of  functions  called  (</>,  A,  p,  v,  r)-distinguishability,  which 
is  stronger  than  the  flow  property  used  to  derive  lower  bounds  on  space-time  tradeoffs  for 
the  pebble  game.  In  fact,  we  show  that  a  function  that  is  (1,  A,  p,  v,  r) -distinguishable  is 
(a,  n,  to,  p) -independent  for  the  appropriate  values  of  a,  n,  to,  and  p. 

DEFINITION  I  0. 1  I .  I  Letr  :  IN'  i— >  IN  be  a  nondecreasing  function.  A  function  f  :  An  i— >  Tm 
is  (4>,  A,  (i,  v,  t)  -distinguishable  for  0  <  (j),  A,  p,  v  <  1  if  there  is  a  set  V  C  An  satisfying 
\D\  >  </>|Al|”  such  that  for  each  assignment  to  a  selection  of  a  <  A  n  input  variables  and  each 
assignment  to  a  selection  of  h  <  pm  output  variables  of  f,  a  <  t(6),  the  number  of  input 
n-tuples  consistent  with  the  values  of  the  a  input  variables  that  cause  f  to  assume  the  given  values 
for  the  b  output  variables  is  at  most  \A\n~a~ub. 

The  meaning  of  this  property  for  the  function  /  is  suggested  by  Fig.  10.17.  For  a  fraction 
of  (j)  of  the  input  tuples  (<f>  =  1  is  the  normal  case),  when  any  a  input  variables  and  any  b 
output  variables  of  /  are  assigned  values,  the  maximum  number  of  input  n-tuples  that  cause 
/  to  produce  these  output  values  is  no  more  than  \  A\n~a~1'b ■  This  property  is  used  below  to 
derive  a  lower  bound  on  the  space-time  product  for  branching  programs.  We  use  <j>  =  1  for  all 
problems  considered  below  except  for  the  unique  elements  problem. 

This  theorem  also  uses  a  version  of  the  pigeonhole  principle.  Time  is  subdivided  into 
intervals  containing  equal  numbers  of  input  queries.  This  has  the  effect  of  chopping  the 
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Figure  10.17  For  a  fraction  of  at  least  cf>  of  the  input  n-tuples,  an  (</>,  A,  /i,  v,  -^-distinguishable 
function  /  has  an  upper  limit  of  \A\n~a~^b  on  the  number  of  input  n-tuples  consistent  with 
an  assignment  of  values  to  any  a  inputs  and  any  b  outputs  of  /  when  a  <  An,  b  <  //m  and 
a  <  r(b). 


branching  program  up  into  layers  (called  stages  in  the  proof).  We  reason  that  each  input  n- 
tuple  follows  a  rich  path  through  a  layer  that  contains  a  large  number  of  outputs.  Because  of 
the  distinguishability  property,  an  upper  limit  on  the  number  of  inputs  can  be  associated  with 
each  rich  path.  It  follows  that  there  must  be  many  rich  paths  or  that  the  branching  program 
must  have  a  large  number  of  vertices  (and  space). 

THEOREM  10. 1  l.l  Let  f  :  An  t— >  Tm  be  (d>,  A,  ft,  v,  t) -distinguish able  for  X  <  p.  Then 
the  space  S  and  time  T  >  n  required  by  any  general  branching  program  V  that  computes  f  must 
satisfy 


„  mva ,  ...  1  , 

S  >  -  l°g2 1-4|  +  -  log2  <j) 

where  a  <  A n  is  the  largest  integer  satisfying  a  <  r(ma/2T)  and  n  >  ([l/A]  —  2) / ( 1  — 
A([~l/A]  —  1)).  (Note  that  log2  <f>  is  a  negative  constant.) 

Proof  We  show  that  S  >  mi/a/2T\og2  +log2  (f>  for  normal-form  branching  programs 
and  then  invoke  Lemma  10.9.2  to  apply  it  to  a  general  branching  program  with  space  2 S 
and  time  T. 

The  approach  is  to  break  V  into  a  =  |"(T+  l)/(a  +  1 )]  disjoint  stages  starting  with  the 
root  at  the  zeroth  level,  each  stage  of  which  contains  a  +  1  levels,  a  <  An,  except  possibly 
for  the  last,  which  may  have  fewer  levels.  ( a  <  2 T/a  since  T  >  n  >  1.)  Each  stage  has 
depth  a.  Thus,  the  last  row  in  one  stage  is  the  first  row  in  the  next  stage.  Each  stage  except 
for  the  first  typically  has  multiple  roots.  (Figure  10.18(a)  shows  a  branching  program  with 
T  =  5  levels.  Since  a  =  2,  it  is  divided  into  cr  =  [ ~(T  +  l)/(a  +  1)]  =  2  layers  by  the 
horizontal  line.  Internal  vertices  belong  to  two  layers.) 

Using  a  modified  version  of  the  technique  described  on  page  49 1  to  create  a  tree  program 
from  a  branching  program,  replace  the  branching  program  in  each  stage  by  a  set  of  tree 
programs  of  depth  a,  shown  in  Fig.  10.18(b).  Eliminate  redundant  queries  on  each  path  in 
each  tree.  Also,  pad  paths  that  do  not  have  a  queries  on  them  with  superfluous  but  non- 
redundant  queries  so  that  each  path  through  each  tree  has  the  same  length.  A  superfluous 
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(a)  (b) 

Figure  10.18  The  transformation  of  a  T -step  branching  program  into  a  branching  program 
with  a  =  |"(T  +  1 )  /  (a  +  1 )]  layers  in  which  each  layer  consists  of  a  forest  of  trees. 


query  has  all  of  its  output  edges  directed  to  a  single  successor  vertex.  Also,  move  all  tree 
outputs  down  to  the  leaves  of  these  trees  (which  are  also  roots  of  trees  in  the  next  stage).  Let 
V*  be  the  new  branching  program.  Since  the  roots  of  trees  in  each  stage  are  vertices  in  the 
original  branching  program,  there  are  no  more  than  2s  trees. 

Let  x  be  one  of  the  input  n-tuples  among  the  fraction  (f>  for  which  ( (f> ,  A,  /i,  v,  r)-dis- 
tinguishability  is  defined.  The  path  through  V*  defined  by  x  passes  through  a  stages. 
Therefore,  there  must  be  at  least  one  stage  containing  a  tree  path  that  produces  at  least 
b  =  [m/cr]  outputs  (a  rich  path).  (As  shown  in  the  last  paragraph  of  this  proof,  b  <  [/tm] 
when  A  <  /i  for  sufficiently  large  n.)  Thus,  x  defines  at  least  one  rich  path.  Let  a  <  r(b). 
Because  the  function  /  :  An  ' — >  Tm  is  (< t> ,  A,  /r,  v,  r)-distinguishable,  each  rich  path  can  be 
associated  with  at  most  \A\n~a~ub  inputs.  (This  number  is  smaller  if  more  than  b  outputs 
are  produced.)  Since  there  are  at  most  2s  trees  and  at  most  \A\a  paths  through  each  tree, 
there  are  at  most  2s|Al|a  rich  paths.  Furthermore,  two  distinct  rich  paths  (either  the  inputs 
queried  or  outputs  produced  are  different)  are  associated  with  disjoint  sets  of  input  n-tuples. 
Thus,  2s|Al|a|^L|"_a_1/b  cannot  be  less  than  the  number  of  input  n-tuples  in  question, 
from  which  the  following  inequality  holds: 

<f>\A\n  <2s\A\a\A\n~a-vb 

We  conclude  that 

S  >  vb\og2  \A\  +  ^log 2</> 

We  replace  b  =  |"m/cr]  by  its  lower  bound  ma/2T.  Since  r(h)  is  a  nondecreasing  function, 
the  value  of  a  satisfying  a  <  r(6)  is  not  increased  by  replacing  b  by  ma/2T.  Thus,  S  > 
v(ma/2T)  log2  |-4|  +  log2  </>,  subject  to  a  <  r(ma/2T )  and  a  <  An. 

We  show  there  exists  an  integer  na  such  that  for  n  >  na  the  condition  b  <  [/tm] 
is  met  by  the  condition  A  <  /t.  Note  that  b  =  \m/<j~\  is  a  nondecreasing  function  of 
a  and  a  nonincreasing  function  of  T  since  <7  =  \(T  +  l)/(a  +  1)]  is  a  nonincreasing 
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function  of  a  and  a  nondecreasing  function  of  T.  Thus,  b  is  largest  when  T  =  n  and 
a  =  An.  It  follows  that  b  is  largest  when  a  =  \{n  +  l)/(An  +1)]  <  |"1/A].  If  n  > 
([1  /A]  —  2) / ( 1  —  A ( f  1/A]  —  1)),  then  (n+  1) /(An+  I)  >  |"  1/A]  —  1,  which  implies  that 
[(n+  l)/(An+  1)]  =  [  1/A] .  In  other  words,  when  n  >  ( [1  /A]  —  2)/(l  —  A ( |”  1  / A]  —  1)), 
b  assumes  a  value  of  at  most  |"m/|T/A]]  <  |"  Am] .  ■ 

COROLLARY  I  0. 1  I .  I  Let  f  :  An  i  *  J-m  be  (</,  A,  p,u,r)  -distinguishable  for  A  <  ft  and 
r(b)  =  n.  Then  the  space  S  and  time  T  required  by  any  normal-form  branching  program  V  that 
computes  f  must  satisfy 

ST  >  log2  \A\  +  log2  cj> 

when  T  >  n  and n  >  ( f  1  / A]  —  2) /( 1  —  A(  f  1  /A]  —  1)). 

Proof  The  result  follows  from  the  observation  that  the  maximum  value  of  a  in  Theo¬ 
rem  10.11.1  is  An.  ■ 

The  connection  between  (a,  n,  to, ^-independence  and  (1,  A,  p,  v,  r)-distinguishability 
is  given  below. 

LEMMA  10. 1  l.l  Iff  :  An  i— ►  Tm  is  (1,A,  p,v  ,t) -distinguishable,  it  is  (1  /  v,n,m,p)- 
independent forp  =  min(An,  r(/zm))  +  pm. 

Proof  Consider  sets  of  a  input  and  b  output  variables  to  /  such  that  a  <  r(b),  a  <  An,  and 
b  <  pm,  or  equivalently  a  <  t*,  where  r*  =  min(An,  t(/tto))  since  t(x )  is  nondecreasing 
in  x.  For  any  particular  assignment  to  the  a  inputs,  the  input  n-tuples  that  agree  with  this 
assignment  but  lead  to  different  values  for  the  b  outputs  must  be  disjoint,  as  suggested  in 
Fig.  10.19.  We  show  that  for  some  assignment  of  values  to  the  a  inputs,  the  number  of 
values  assumed  by  the  b  outputs  is  more  than  |Al|b^Q_1  for  a  =  1/to  Suppose  not.  Then 
there  are  at  most  \A\n~a~ub\Afb^1  input  tuples  for  each  assignment  to  the  a  inputs,  or  a 
total  of  at  most  I.4I™-1  input  tuples.  Since  /  has  |Al|”  input  tuples,  we  have  a  contradiction. 
Therefore,  /  is  (l/i/,n,TO,p)-independentforp  =  r*  +  pm.  ■ 

The  following  lemma  makes  it  easier  to  derive  space-time  lower  bounds  for  branching 
programs.  It  uses  the  notions  of  subfunction  (see  Definition  2.4.2)  and  reduction  (see  Defini¬ 
tion  2.4.1). 


Figure  10.19  On  the  left  are  the  points  in  the  domain  of  /  that  map  to  individual  output 
b-tuples  when  the  values  of  a  input  variables  are  fixed. 
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LEMMA  10. 1  1.2  Letg  :  Ar  i— >  As  be  a  reduction  off  :  An  i— ►  Am  that  is  either  a  subfunction 
or  a  reduction  obtained  by  restricting  f  to  a  subset  of  its  domain.  A  lower  bound  to  the  space-time 
product  ST  on  branching  programs  for  g  is  also  a  lower  bound  for  f. 

Proof  Given  any  branching  program  for  /,  we  can  construct  one  for  g  that  has  no  more 
vertices  or  longer  paths  as  follows.  If  g  is  obtained  by  deleting  outputs,  delete  these  outputs 
from  vertices  in  the  branching  program.  This  may  allow  the  coalescing  of  vertices.  If  g  is 
obtained  by  restricting  the  set  of  values  that  variables  of  /  can  assume,  this  may  make  some 
paths  and  subgraphs  inaccessible  and  therefore  removable.  If  g  is  obtained  by  giving  two 
variables  of  /  the  same  identity,  this  constrains  the  branching  program  and  again  may  make 
some  subgraphs  inaccessible.  In  all  cases  neither  the  number  of  vertices  nor  the  length  of 
any  path  to  a  sink  vertex  is  increased  by  the  reduction  of  /  to  g.  Thus,  any  lower  bound  to 
ST  for  g  must  be  a  lower  bound  for  /.  ■ 

10.12  Properties  of  “nice”  and  “ok”  Matrices* 

In  this  section  we  develop  properties  of  matrices  that  are  7-nice  or  7-ok,  concepts  we  now 
introduce.  (A  matrix  that  is  7-nice  is  also  7-ok.)  These  properties  are  used  in  Section  10.13 
to  develop  lower  bounds  on  the  exchange  of  space  for  time  using  the  Borodin-Cook  method. 
This  section  requires  a  knowledge  of  probability  theory. 

DEFINITION  I  0. 1 2. 1  An  n  x  m  matrix  A,  n  <  m,  is  7-nice  for  0  <  7  <  1/2  if  and  only  if 
for  all  p  <  \  7  n]  and  q  >  n  —  \  7  n\  every  p  X  q  submatrix  of  A  has  rank  p.  Such  a  matrix  is 
7-ok  if  all  such  p  X  q  submatrices  have  rank  at  least  7 p. 

As  shown  below,  most  matrices  are  7-nice,  a  fact  that  is  used  in  several  places. 

LEMMA  I  0. 1 2. 1  At  least  a  fraction  (1  —  (2/3)7")  of  the  |»4|ra  n  X  n  matrices  over  a 

subset  A  ofafield,  |»4|  >  2,  arey-nice for  some  constant  7,  0  <  7  <  f  independent  of  n  and  A. 
This  result  also  holds  forn  x  n  Toeplitz  matrices,  matrices  \ ©•]  with  the  property  thatt^j  = 
that  is,  all  elements  on  each  diagonal  are  the  same. 

Proof  Let  r  =  \yn\  and  s  =  n  —  r.  The  proof  is  established  by  deriving  upper  bounds  on 
the  number  N [r,  s)  of  r  X  s  matrices  in  an  n  X  n  matrix  M  and  the  probability  q(r,  s)  that 
any  particular  r  X  s  matrix  fails  to  contain  a  non-singular  r  x  r  submatrix  (it  fails  to  have 
rank  r)  when  each  entry  in  M  is  equally  likely  to  be  an  element  of  A.  Since  the  probability 
of  a  union  of  events  is  at  most  the  sum  of  the  probabilities  of  the  events,  the  probability  that 
some  r  x  s  matrix  fails  to  have  rank  r  is  at  most  q(r,  s)N(r,  s). 

It  is  straightforward  to  show  that 


N(r,  s)  = 

since  an  r  x  s  submatrix  of  an  n  X  n  matrix  is  chosen  by  selecting  a  set  of  r  rows  and 
a  set  of  s  columns  and  each  can  be  chosen  in  (")  ways.  (Note  that  (")  =  (").)  We 
now  show  that  the  binomial  coefficient  (/)  is  at  most  ( n/r)rer .  We  use  the  fact  that 
n\/{n  —  r)!  =  n(n  —  1)  •  •  •  (n  —  r  +  1)  <  nr  and  the  observation  that  rr /r\  is  a  term  in 
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the  Taylor-series  expansion  of  er,  as  stated  below: 

n\  nr 

r\{n  —  r)!  —  H 


/n\r  rr  (nY 
\r  J  r!  —  V  r  / 


Later  we  show  that  q(r,  s)  <  p  s|^4|r  ',  where  p  =  |.4|2/(2|4  —  1)  <2\A\/3,  from 
which  it  follows  that 


q(r,s)N(r,s)  <\A\  1  (y)  P  V 14 


sw-(|)' 


en|4 


since  s  =  n  —  r.  Elementary  calculus  shows  that  {e\A\/r)2r  is  an  increasing  function  of 
r  and  that  it  has  value  1  at  r  =  0.  Since  r  =  [7 n\  and  p  >  4/3,  it  follows  that  the 
quantity  in  square  brackets  is  less  than  1  for  some  value  of  0  <  7  <  1  /2,  which  is  the 
desired  conclusion. 

We  now  give  a  proof  by  induction  that  q(r,s)  satisfies  q(r,s)  <  p~s\A\r~l .  Clearly 
9(1.1)  <  1/|.A|,  since  at  most  one  entry  in  A  is  zero.  This  satisfies  the  bound.  We  now 
assume  the  inductive  hypothesis  holds  for  q(r  —  1,  s  —  1)  and  q{r,  s  —  1)  and  show  that  it 
holds  for  q(r ,  s). 

Consider  an  r  X  s  matrix  B.  It  has  rank  r  if  the  submatrix  consisting  of  the  first  S  —  1 
columns  has  rank  r.  (This  occurs  with  probability  1  —  q[r,  s  —  1).)  If  this  is  not  the  case, 
there  are  many  other  ways  in  which  it  can  have  rank  r.  In  particular,  this  is  true  if  the 
submatrix  C  consisting  of  the  last  r  —  1  rows  and  the  first  S  —  1  columns  of  B  has  rank 
r  —  1  (with  probability  1  —  q{r  —  1,  s  —  1))  and  the  element  b\tS  has  an  appropriate  value 
(with  probability  at  least  1  —  1/|4),  as  we  now  show. 

Consider  a  submatrix  D  consisting  of  some  r  —  1  linearly  independent  columns  of  C. 
Consider  the  r  x  r  submatrix  of  B  consisting  of  these  same  r  —  1  columns  and  its  last 
column.  When  the  determinant  of  this  matrix  is  expanded  on  the  first  row,  the  multiplier  of 
b iiS  is  ±1  times  the  determinant  of  D,  which  is  non-zero.  Thus,  there  is  at  most  one  value 
for  b\iS  that  causes  the  determinant  to  be  zero  (the  field  element  causing  it  to  be  zero  may 
not  be  in  the  set  .4)  or  at  least  |>4|  —  1  values  that  cause  it  to  be  non-zero.  Summarizing  this 
result,  we  have  the  following  lower  bound: 


1  -  q(r,  s)>  1  -  q(r ,  s  -  1)  +  (1  -  q(r  -  1,  s  -  1))  (  1  -  |-^ 


>  (1  -  q{r,s-  l))r-rT  +  (1  -  q(r  -  l,s  -  1))  1  -  —rr 


14 


14 


This  implies  that 


q{r,  s)  <  q{r,  s  — 


+  q(r  —  1,  s  -  1) 


<P-«Mr-^(2 

<p~a\Ari 


i 

14 


i 

14 
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which  is  the  desired  conclusion. 

The  proof  also  holds  for  Toeplitz  matrices  (each  element  on  a  diagonal  of  the  matrix 
is  the  same)  because  we  reasoned  only  about  the  value  of  elements  in  the  upper  right-hand 
corner  of  submatrices  that  are  on  different  diagonals.  ■ 

The  Kronecker  product  of  matrices  is  used  in  Section  10.13.5  to  derive  a  lower  bound  on 
the  space-time  product  for  matrix  inversion. 

DEFINITION  I  0. 1 2.2  The  Kronecker  product  of  two  n  x  n  matrices  A  and  B  is  the  n2  X  n2 
matrix  C,  denoted  C  =  A  ®  B,  obtained  by  replacing  the  entry  dij  of  A  with  the  matrix  diyB. 

A  Kronecker  product  C  =  A®  B  of  matrices  A  and  B  is  shown  below: 


'  5 

6 

10 

12  " 

1  2  ’ 

,  B  = 

'  5 

6  ‘ 

C  = 

7 

8 

14 

16 

3  4 

7 

8 

15 

18 

20 

24 

.  21 

24 

28 

32  . 

The  following  property  of  the  Kronecker  product  of  two  7-nice  matrices  is  used  to  derive 
the  space-time  lower  bounds  stated  in  Theorem  10.13.5. 

LEMMA  10.12.2  If  A  and  B  are  both  n  X  n  y-nice  matrices  for  some  0  <  7  <  1/2,  then 
C  =  A®  B  is  an  n2  x  n2  7 2 -ok  matrix. 

Proof  Number  the  rows  and  columns  of  A,  B,  and  C  consecutively  from  0.  For  a  matrix 
E,  extend  the  notation  e, j  for  the  entry  in  the  ?'th  row  and  jth  column  of  E  to  ejj,  by 
which  we  denote  the  submatrix  of  E  consisting  of  the  intersection  of  the  rows  in  the  set  I 
and  columns  in  the  set  J.  Thus,  if  I  =  {z}  and  J  =  {j},  then  e/,j  =  e*  j. 

To  show  that  C  is  72-ok,  we  must  show  that  every  p  X  q  submatrix  S  of  C  satisfying 
p  <  \y2  n 2]  and  q>  n  —  |fy2  n2]  has  rank  at  least  f2  p.  Such  a  matrix  S  can  be  represented 
as  S  =  cij  for  index  sets  /  and  J,  where  p  =  |/|  <  |fy2  n2]  and  q  =  |  J\  >  n  —  {y2  n2~\. 
We  assume  that  yn  >  1 ,  since  otherwise  the  result  holds  trivially. 

The  rth  block  row  of  C  is  the  submatrix  [arfiB,  ar<\B, . . . ,  ar,n_i-B]  containing  rows 
numbered  Ir  =  {rn,  rn  +  1 , . . .  ,rn  +  n  —  1}  and  all  n2  columns. 

Let  Ar  =  In  {rn,  rn  +  1 , .  . . ,  rn  +  n  —  1 }  be  the  indices  of  the  rows  of  S  that  fall  into 
the  rth  block  row.  Choose  a  set  T  C  {0,  1, 2, . .  . ,  n  —  1}  of  size  |T|  =  [7  n]  that  maximizes 
the  sum  T  =  2rer  l^rl-  Then,  T  >  yp  because  the  lower  bound  is  achieved  if  the  rows 
of  S  are  uniformly  distributed  over  the  rows  of  C  and  T  is  larger  if  they  are  not. 

Let  Ar  =  Ar  if  Ar|  <  \yn\  and  let  Ar  consist  of  the  smallest  \yri\  indices  in  Ar 
otherwise.  Clearly,  |Ar|  >  |Ar|7  because  Ar  is  chosen  from  a  set  of  size  n.  Call  rows  of  C 
with  indices  in  (J  /  gt  Ar  blue  rows.  There  must  be  at  least  7 2p  blue  rows  because,  if  not, 


7  2P  >  iAr,i  -  =  dT>  i2v 

rer  rer 


which  is  a  contradiction. 

We  now  show  that  the  blue  rows  of  S  are  linearly  independent.  Suppose  not.  Then 
there  exist  constants  {ar,s  |  r  £  T,  s  €  Ar}  not  all  of  which  are  zero  such  that  the  linear 
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combination  of  the  blue  rows  of  S  is  zero: 

y  'j  y  'j  oLrtScnr-\.Sij  =  o  (io.7) 

rGT  sGAr 


Here  0  is  a  column  vector  of  zeros,  one  per  blue  row.  Again,  J  is  the  set  of  columns  of  C  in 
the  submatrix  S. 

Column  j  of  the  nxn  matrix  B  is  good  if  it  is  associated  with  at  least  (1—7)71  columns 
of  S  and  is  bad  otherwise.  Let  G  be  the  indices  of  the  good  columns  in  B  and  let  <7  =  |G|. 
Then  there  are  g  >  (1  —  7)71  good  columns  and  b  <  771  bad  columns  in  B  (g  +  b  =  n) 
because,  if  not,  g<  (1—7)71—1  and  the  number  of  columns  altogether  in  S  is  at  most 
gn  +  6(1  —  7)71,  which  is  an  increasing  function  of  g  whose  value  is  less  than  n2  —  \r)2  n2~\ 
when  g  <  (1  —  7)71  —  1,  which  is  less  than  the  number  of  columns  of  S. 

Since  B  has  at  least  g  =  \G\  >  ( 1  —  7)  71  good  columns  and  B  is  7-nice,  any  set  of  up  to 
|~ 771]  rows  are  linearly  independent.  In  particular,  the  rows  of  B  indexed  by  Ar  are  linearly 
independent.  This  implies  that 


)  '  ar,sbs,G  7^  0 

seAr 


where  0  is  a  zero  column  with  |Ar|  rows.  Thus,  there  must  be  a  column  index  t  £  G  such 
that 


^  av.A.t  7^  0  (10.8) 


Let  K  =  {j  |  nj  +  t  £  J}  be  the  columns  of  S  corresponding  to  the  good  column  of  B 
with  index  t.  It  follows  that  |AT|  >  [(1  —  7)77] . 

Let  Ui  =  >  the  intersection  of  the  7th  row  of  S  with  columns  whose  indices  are  in 

K.  Similarly,  let  17  be  the  intersection  of  the  7th  row  of  A  with  columns  in  K.  It  follows 
from  the  definition  of  C  that  uni+j  =  bj^Vi.  From  (10.7)  we  have  that 


EE 

rer seAr 


&r,sCnr-\-s,JnK 


0 


vr  —  0 


However,  the  rows  |L|  rows  vr  constitute  a  [777]  x  \K\  submatrix  of  the  7-nice  matrix  A 
where  \K\  >  |_(  1  —  'y)nj .  Since  its  rows  are  linearly  independent,  each  of  the  coefficients 
SseAr  ar,sbs,t  must  be  zero,  contradicting  the  statement  of  (10.8).  It  follows  that  C  = 
A  <g>  B  is  72-ok.  ■ 


10.13  Applications  of  the  Borodin-Cook  Method 


In  this  section  we  illustrate  the  Borodin-Cook  method  of  Section  10.11  by  applying  it  to  a 
variety  of  representative  problems. 
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10.13.1  Convolution 

The  wrapped  convolution  function  /^"^pped  ■  ^2n  l— ^  over  che  ring  1Z  (see  Problem  6.19) 

of  two  sequences  u  and  v  is  described  by  the  matrix-vector  product  Cv  of  a  circulant  matrix 
C  in  which  Cy  =  U^_j^  mo(j  „,  as  shown  in  Section  10.5.1. 

LEMMA  I  0. 1  3. 1  For  n  even,  the  wrapped  convolution  f^lpped  ■  TZ2n  t— ►  7 Zn  over  the  ringlZ 
contains  a  subfunction  :  !Zln  i— >  lZn! 2  that  is  (1,7/2, 7/2,  l,2n) -distinguishable  for  some 
0  <  7  <  1/2. 

Proof  Writing  C  as  a  2  X  2  matrix  ofn/2  X  n/2  matrices,  we  find  that  its  (1,1)  entry  is 
an  unrestricted  Toeplitz  matrix  T.  That  is,  each  diagonal  can  contain  a  different  element. 
Consider  the  subfunction  of  /^,”apped  defined  by  this  submatrix.  By  Lemma  10.12.1,  a 
fraction  of  at  least  1  —  (2/3)^^2^n /\TZ\  of  such  matrices  are  7-nice.  By  Definition  10.12.1, 
this  implies  that  [(7/2)71]  output  variables  assume  \1Z\  l©/2)”'  different  values.  If  we  fix 
the  entries  of  T  to  be  those  of  a  7-nice  matrix,  by  Lemma  10.11.2  the  lower  bound  on  ST 
for  matrix-vector  multiplication  with  a  Toeplitz  matrix  with  n  replaced  by  n/2  serves  as  a 
lower  bound  for  the  original  problem.  Since  for  large  n  most  Toeplitz  matrices  are  7-nice, 
we  have  the  desired  conclusion.  ■ 

Invoking  Theorem  10.1 1.1,  we  have  the  space-time  lower  bound  stated  below.  The  up¬ 
per  bound  follows  from  the  design  of  a  branching  program  to  implement  the  inner  product 
operation,  as  suggested  by  Fig.  10.6. 

THEOREM  10. 13.1  There  is  an  integer  n0  >  0  such  that for  n  even  and  n  >  no,  the  time  T  and 
space  S  used  by  any  general  branching  program  for  the  wrapped  convolution  pped  :  7?.2n  * 

7 Zn  over  the  ring  1Z  must  satisfy 

ST  =  fl(n2  log  \1Z\)  (10.9) 

Branching programs  exist  that  achieve  the  following  bound for  log  |7£|  <  S  <  nlog|©: 

ST  =  0(n2  log  n  log  \TZ\) 

Proof  Since  the  wrapped  convolution  function  depends  on  2 n  variables,  it  can  be  computed 
via  table  lookup  with  space  0(n log  \TZ\)  and  time  0(n). 

At  the  limit  of  small  space,  namely  for  S  =  ©(log  \7Z\),  a  branching  program  can 
be  designed  that  computes  the  n  inner  products  defined  by  the  matrix-vector  product  of 
(10.1).  An  example  of  a  branching  program  to  compute  the  inner  product  of  two  3-vectors 
is  shown  in  Fig.  10.20.  A  branching  program  for  the  inner  product  of  two  n-tuples  can  be 
constructed  that  has  0(n\7Z\2)  vertices  and  depth  Oin ).  Hence,  a  branching  program  to 
multiply  a  general  n  x  n  matrix  by  a  vector  can  be  constructed  that  has  time  0{n2)  and 
space  0(\ogn  +  log  |7£|). 

To  fill  in  the  range  between  these  extremes,  let  k  divide  n  and  note  that  the  product  of 
an  n  x  7i  matrix  by  a  column  71-vector  can  be  viewed  as  the  product  of  an  n/k  X  n/k  matrix 
of  k  x  k  matrices  with  a  column  n/fc-vector  of  column  /c -vectors.  Since  each  product  of 
a  k  x  k  submatrix  by  a  Tc-vector  is  a  function  of  0(k)  parameters,  compute  it  with  table 
lookup  in  time  0(k)  and  space  0(k  log  |7£|).  Add  two  of  these  matrix-vector  products  by 
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Figure  I  0.20  A  branching  program  to  compute  the  inner  product  of  two  3-vectors  over  the  set 
1Z  of  integers  modulo  2. 


rooting  a  table-lookup  program  at  each  of  the  0(\TZ\k)  final  states  of  a  first  table-lookup 
program.  Coalesce  final  states  corresponding  to  the  \TZ\k  sums  of  the  two  column  fc-vectors. 
This  program  has  0(\R,\lk)  vertices  or  space  0(fclog  |72|)  and  time  0(k).  n/k  such  stages 
increase  the  number  of  vertices  and  time  each  by  a  factor  of  n/k.  Since  this  process  is 
then  repeated  for  each  of  the  n/k  rows  of  the  block  matrix,  the  space  and  time  used  are 
0(fclog  1 72. |  +  log(n/fc))  and  0(n2/k),  respectively.  ■ 

10.13.2  Integer  Multiplication 

To  derive  space-time  lower  bounds  for  integer  multiplication,  we  could  invoke  the  reductions 
from  this  problem  to  cyclic  shifting,  as  was  done  in  Section  10.5.3.  However,  as  shown  in 
Section  10.10,  the  space-time  product  for  cyclic  shifting  is  only  0(n  log  n).  Thus,  we  are 
forced  to  use  another  reduction  to  obtain  a  strong  space-time  product  lower  bound,  namely  a 
reduction  from  integer  multiplication  to  convolution. 
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Let  S)  be  the  ring  of  integers  modulo  2.  As  shown  in  Problem  6.20,  the  integer  multi¬ 
plication  function  ©©  :  Bln  t— >  Bln  contains  the  convolution  function  over  fconv°S  : 
Z,,!/i°g"  i— >■  TLf  '/los” .  Thus,  by  Lemmas  10.1 1.2  and  10.13.1  the  following  holds: 


THEOREM  10.13.2  There  is  an  integer  no  >  0  such  that  for  n  >  no  the  time  T  and  space  S 
used  by  any  general  branching program  for  binary  integer  multiplication  /^lt  :  B2n  <— >  Bln  must 
satisfy 

ST  =  fi(n2/log2  n)  (10.10) 


This  lower  bound  can  be  achieved  to  within  a  factor  ofO( log3  n)  for  space  fl  (log  n)  <  S  < 
0(n). 


Proof  Since  the  integer  multiplication  function  depends  on  2 n  variables,  it  can  be  com¬ 
puted  via  table  lookup  with  space  0(n)  and  time  O(n),  thereby  meeting  the  lower  bound 
to  within  a  factor  of  0( log2  n). 

At  the  limit  of  small  space,  S  =  @(logn),  the  integer  multiplication  algorithm  of 
Section  10.5.3  provides  a  branching  program.  Since  at  most  [log2  n]  bits  suffice  for  the 
carry  from  one  power  of  2  to  the  next,  a  branching  program  based  on  this  algorithm  has 
at  most  0( 2riog2nl)  vertices  at  each  of  n2  levels.  Thus,  this  program  uses  time  0(n2)  and 
space  0(log  n),  achieving  the  lower  bound  to  within  a  factor  of  0(log  n). 

We  sketch  a  procedure  to  fill  in  the  range  of  space  between  these  extremes  and  ask  the 
reader  to  complete  the  details.  (See  Problem  10.39.)  Assume  that  k  divides  n  and  represent 
each  n-bit  binary  number  as  an  (n/k) -component  base-2fe  number.  As  in  the  standard  bi¬ 
nary  integer  multiplication  algorithm  (where  k  =  1),  form  n/k  (n/k) -component  numbers 
through  multiplication  and  shifting  of  consecutive  base-2fc  components,  as  suggested  below: 


V3U0 

V2U0 

V\U0 

v0  Uq 

V3Ui 

V2U1 

V\U\ 

VqUi 

0 

V3U2 

V2U2 

V\U2 

V0u2 

0 

0 

V3U3  V2U3 

VlU3 

V0u3 

0 

0 

0 

Here  ur  and  vs  are  baseV’  numbers.  Multiply  two  such  numbers  through  table  lookup  in 
time  and  space  O(k).  Extend  the  algorithm  for  the  base-2  case  by  replacing  each  subpro¬ 
gram  that  multiplies  two  binary  numbers  by  the  table  lookup  program  to  multiply  base-2fc 
numbers.  This  new  program  adds  products  to  a  running  sum  of  length  0(log  n)  bits.  Thus, 
it  uses  space  0(k  +  logn)  and  time  0(n2/k),  giving  a  space-time  product  of  0(n2  log  n) 
for  k  >  log  ?i.  ■ 

10.13.3  Matrix- Vector  Product 

The  matrix- vector  product  function  f^x  •  l— >  computes  the  71- tuple  y  from  the 

n-tuple  x  for  a  fixed  n  x  n  matrix  A  over  1Z  according  to  the  rule 

y  =  Ax 

where  7/,  =  Y/k= o  aj,kXk  for  0  <  j  <  n  —  1. 
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LEMMA  10.13.2  Let  A  be  a  7 -ok  nxn  matrix  over IZfor some  0  <  7  <  1/2.  Then  the  matrix- 
vector  product  function  l— >  is  (1, 7,7,7,  ^-distinguishable  where  rib)  =  n. 


Proof  To  show  that  f^y  X  is  ( 1, 7, 7, 7,  r) -distinguishable,  select  any  a  <  [7 n]  inputs  and 
any  b  <  [771]  outputs.  If  the  ith  input  is  chosen  and  it  has  value  Ui,  introduce  the  equation 
Xi  =  Ui.  Let  B  be  the  a  x  n  coefficient  matrix  defining  these  equations;  that  is,  Bx  =  u, 
where  B  contains  the  j  th  row  of  the  n  X  n  identity  matrix  if  the  j  th  variable  is  among  the 
selected  inputs. 


Consider  the  (n  +  a)  x  n  matrix  C 


A 

B 


.  We  show  that  it  has  rank  a  +  7 b.  The 


submatrix  D  of  A  consisting  of  the  intersection  of  those  columns  not  selected  by  inputs  (of 
which  there  are  n  —  a  >  n  —  |" 7 n\ )  and  rows  selected  by  outputs  (of  which  there  are  b) 
has  rank  7 b  because  A  is  7-ok.  Thus,  7 b  of  the  n  —  a  columns  of  A  not  selected  by  inputs 
and  the  a  non-zero  columns  of  B  are  linearly  independent.  Thus,  the  submatrix  E  of  C 
consisting  of  the  selected  rows  of  B  and  the  rows  of  D  has  rank  a  +  7 b. 

The  number  of  n-tuple  input  vectors  x  consistent  with  the  linear  system  Ex  =  d  is 

]A\n-a-,b 

,  as  we  show.  Without  loss  of  generality  assume  that  the  first  a+yb  columns  of  E 
(call  it  F)  are  linearly  independent.  (Permute  the  columns,  if  necessary,  so  that  this  is  true.) 
Fix  the  values  of  the  b  realizable  outputs.  Then  for  each  assignment  to  inputs  corresponding 
to  the  last  n  —  (a  +  76)  columns  there  are  unique  values  for  the  first  a  +  yb  inputs,  due  to 
the  non-singularity  of  F.  Thus  the  number  of  assignments  to  the  last  n  —  (a  +  7 b)  columns 
that  are  consistent  with  values  for  the  a  inputs  and  b  outputs  is  \A\n~a~^b.  ■ 


Invoking  Corollary  10.11.1  yields  the  following  result. 

THEOREM  1 0. 1  3.3  Let  A  be  a  7 -ok  nxn  matrix  over  1Z  for  some  0  <  7  <  1/2.  Then  there 
is  a  constant  0  <  7  <  1/2  and  an  integer  no  such  that  for  n  >  no  the  space  S  and  time  T  used 
by  any  general  branching  program  for  the  function  f^x  •  l_ >  Fn  must  satisfy  the  following 

lower  bound  when  T  >  n: 


ST  =  n(n2  log  \n\) 

This  lower  bound  can  be  met  to  within  a  factor  ofO(  log  n)  for  log  n  <  S  <  n. 

Proof  The  lower  bound  follows  from  the  application  of  Theorem  10.1 1.1. 

The  matrix-vector  product  Ax  for  an  n  X  n  matrix  A  can  be  done  with  a  branching 
program  for  the  standard  algorithm  as  follows:  Compute  the  inner  product  of  the  ith  row 
with  the  column  x  for  1  <  i  <  n.  The  inner  product  of  two  n-tuples  can  be  computed 
with  a  branching  program  having  0(n\lZ\2)  vertices,  as  suggested  in  Fig.  10.20.  (This  is 
true  even  if  A  is  not  fixed.)  n  branching  programs  for  inner  products  can  be  concatenated  to 
form  one  branching  program  to  multiply  an  nxn  matrix  with  an  n-vector.  This  branching 
program  uses  space  0(logn  +  log  |7£|)  and  time  0(n2),  thereby  meeting  the  lower  bound 
to  within  a  factor  of  0(log  n ) . 

A  matrix-vector  product  for  a  fixed  matrix  (this  case)  can  also  be  computed  by  table 
lookup  in  space  0{n  log  |7£|)  and  time  0(n )  since  this  function  has  n  variables. 

To  bridge  the  gap  between  these  two  results,  compute  the  matrix-vector  product  using  a 
hybrid  algorithm  similar  to  that  used  for  convolution  in  the  proof  of  Theorem  10.13.1.  ■ 
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10.13.4  Matrix  Multiplication* 

The  space— time  lower-bound  argument  for  matrix  multiplication  in  the  branching  program 
model  uses  ideas  similar  to  those  used  for  matrix-vector  multiplication. 

LEMMA  10.13.3  The  matrix  multiplication  function  1 — >  TV1  over  the  ring  1Z  is 

(1,1, 1,7 / 4,  ^-distinguishable  for  some  0  <  7  <  1/2,  where  r(b)  =  7 n^/b/2 . 

Proof  Consider  the  subfunction  of  f^xB  obtained  by  choosing  A  and  B  from  the  set  of 
n  X  n  7-nice  matrices.  By  Lemma  10.1 1.2,  a  lower  bound  on  the  space— time  product  for 
this  subfunction  provides  a  lower  bound  to  the  matrix  multiplication  function. 

Consider  some  a  <  In2  selected  inputs  and  some  6  <  n2  selected  outputs  such  that 
a  <  r(6);  that  is,  ( a/'yn )2  <  6/2.  The  outputs  correspond  to  entries  of  the  product  matrix 
C  =  A  X  B.  Let  row  i  of  C  be  a  heavy  row  if  at  least  7 n  of  the  a  selected  inputs  are  in 
row  i  of  A.  Similarly,  let  column  j  of  C  be  a  heavy  column  if  at  least  yn  of  the  a  selected 
inputs  are  in  column  j  of  B.  A  row  or  column  of  C  is  light  otherwise.  (See  Fig.  10.21.) 

There  are  at  most  a/'yn  heavy  rows  and  a/'yn  heavy  columns  of  C .  We  now  show  that 
either  a)  at  least  6/4  of  the  selected  outputs  fall  into  light  rows  of  C  or  b)  at  least  6/4  of 
the  selected  outputs  fall  into  light  columns  of  C.  Suppose  not.  Then  both  statements  are 
false  and  less  than  6/4  of  the  selected  outputs  fall  into  light  rows  and  less  than  6/4  of  the 
selected  outputs  fall  into  light  columns  of  C .  It  follows  that  at  least  36/4  of  the  selected 
outputs  fall  into  heavy  rows.  Of  these  at  most  (a/'yn)2  fall  into  heavy  columns,  since  this  is 
the  maximum  number  of  entries  of  C  that  could  be  in  both  heavy  rows  and  columns.  The 
remaining  selected  outputs  in  these  rows  (of  which  there  are  less  than  6/4)  fall  into  light 
columns.  However,  because  the  entries  in  each  row  fall  into  either  heavy  or  light  columns, 
the  number  of  selected  outputs  that  are  in  heavy  rows  is  less  than  (a/'yn)2  +  6/4.  But  this 
is  less  than  36/4  since  a  <  r(6)  =  'yn^/b/l,  contradicting  the  stated  hypothesis. 

Without  loss  of  generality,  assume  that  b  holds.  (If  not,  a  holds  and  at  least  6/4  selected 
outputs  fall  into  light  rows  of  C  or  into  light  columns  of  the  transpose  CT .)  Represent  the 


C  =  A 


B 


Figure  10.21  Identification  of  heavy  rows  and  columns  of  matrices. 
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product  C  =  A  x  B  as  follows: 


'  A 

'  B1 

'  C1  ' 

A  _ 

Brl 

Cn 

Here  Bl  and  Cl  are  the  ith  columns  of  the  matrices  B  and  C,  respectively.  Let  B  and 
C  denote  the  columns  of  these  columns,  respectively,  and  let  D  denote  the  block  diagonal 
matrix  on  the  left. 

We  show  that  at  most  \R\ln  -a~~iblA  Gf  the  matrix  pairs  (A,  B)  are  consistent  with  any 
assignment  to  any  set  of  a  selected  inputs  and  values  of  any  6  selected  outputs. 

Of  the  a  selected  inputs,  let  a\  be  drawn  from  A  and  <22  be  drawn  from  B,  where 
a  =  d\  +  <22.  The  number  of  7-nice  matrices  A  consistent  with  the  a\  selected  inputs  from 
A  is  at  most  \K\n  ~°l.  We  now  bound  the  number  of  matrices  B  that  are  consistent  with 
the  values  of  selected  inputs  and  outputs. 

Let  A  be  fixed  and  7-nice.  Consider  just  the  (at  least  6/4)  selected  outputs  that  fall  into 
light  columns  of  C.  Every  value  for  B  consistent  with  the  selected  inputs  and  these  outputs 
must  satisfy  the  following  linear  equation: 


E 

F 


B  =  HB  = 


r 

c 


Here  E  consists  of  the  6  rows  of  D  corresponding  to  selected  outputs  and  F  is  a  submatrix 
of  the  n2  X  n2  identity  matrix  consisting  of  the  <22  rows  corresponding  to  selected  inputs 
in  B.  c  is  the  column  of  values  for  the  selected  inputs  in  B  and  r  is  a  column  of  selected 
outputs  of  C  that  fall  into  light  columns.  The  number  of  values  for  B  consistent  with  a 
fixed  A  and  the  values  of  the  selected  inputs  and  outputs  is  no  more  than  the  number  of 
solutions  B  to  these  equations,  since  we  are  ignoring  outputs  in  heavy  rows. 

We  now  show  that  H  has  rank  at  least  <22  +  76/4.  A  column  of  H  is  queried  if  a  column 
of  E  contains  a  selected  input  or  the  corresponding  row  of  B  contains  a  selected  input.  02 
of  these  columns  correspond  to  selected  inputs  in  B  and  are  linearly  independent  because 
the  corresponding  columns  of  F  are  linearly  independent.  Consider  the  unqueried  columns 
of  H.  These  columns  in  F  are  zero  columns.  Thus,  consider  these  unqueried  columns  in 
E.  Consider  k  rows  in  E  that  come  from  a  common  copy  of  A  on  the  diagonal  of  D.  The 
column  Bl  of  B  corresponding  to  this  copy  of  A  is  light  (it  has  fewer  than  772  selected 
entries)  because  the  corresponding  column  of  C  is  chosen  to  be  light.  Thus,  this  copy  of  A 
has  at  least  n(  1  —  7)  unqueried  entries,  or  at  least  72(1  —  7)  of  its  columns  are  unqueried. 

Since  A  is  7-nice,  the  unqueried  columns  of  this  copy  of  it  have  rank  at  least  min  ( k ,  772). 
Because  there  are  no  dependencies  between  columns  in  distinct  copies  of  A  in  D,  the  num¬ 
ber  of  linearly  independent  unqueried  columns  of  E  is  minimal  if  they  all  fall  in  as  few 
common  copies  of  A  as  possible,  because  then  min  (fc,  772)  =  772.  It  follows  that  the  un¬ 
queried  columns  of  E  have  rank  at  least  76/4.  Since  the  queried  columns  have  rank  at  least 
<22,  the  columns  of  H  have  rank  at  least  02  +  76/4.  It  follows  from  an  argument  given 
in  the  proof  of  Lemma  10.13.2  that  the  number  of  solutions  B  to  this  system  is  at  most 
\lZ\n  ~ai-"tblA '  Since  there  are  at  most  \R\n  ~a'  matrices  A  that  are  7-nice  and  consistent 
with  the  a\  selected  inputs  in  A,  it  follows  that  the  number  of  pairs  consistent  with  values 
of  the  selected  inputs  and  outputs  is  at  most  \1Z\2n  -a-ibB  ^  the  desired  conclusion.  ■ 
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This  result  provides  a  lower  bound  on  the  space  and  time  for  matrix  multiplication.  The 
upper  bound  cited  below  is  obtained  by  another  hybrid  algorithm  that  mixes  a  branching 
program  for  the  standard  algorithm  with  one  for  table  lookup. 

THEOREM  10.13.4  There  is  an  integer  no  >  0  such  that  for  n  >  Uq  the  space  S  and  time  T 
needed  to  compute  the  matrix  multiplication  function  /©  B  :  TZln  i— >  lZn  over  the  ringTZ  using 
a  general  branching  program  satisfies  the  inequality: 

ST2  >  log2  \n\ 

for  some  0  <  7  <  1/2  when  T  >  n 2 .  This  lower  bound  can  be  achieved  up  to  a  multiplicative 
factor  of  0(\ogn)  for  space  in  the  range  Ll(\ogn  +  \og\A\)  <  S  <  0(nlog|>4|). 

Proof  The  lower  bound  follows  from  Theorem  10.11.1  and  Lemma  10.13.3  by  letting 
a  =  lyfrfi /4TJ,  since  this  value  of  a  satisfies  the  two  conditions  a  <  r(ma/2T)  = 
ynsj ma/4T  and  a  <  2 n2  when  T  >  n2. 

At  the  extreme  of  large  space,  namely  S  =  0(n2),  the  upper  bound  follows  from 
a  branching  program  for  table  lookup  that  has  one  level  for  each  of  the  2 n1  variables  in 
the  matrices  A  and  B  and  the  fact  that  there  are  \TZ\2n  pairs  of  such  matrices  over  the 
ring  7Z.  Consequently,  the  branching  program  has  at  most  0{\lZ\2n  )  vertices  and  space 
0(n2  log  |7?.|).  It  uses  0(n2)  steps. 

At  the  extreme  of  small  space,  namely  S  =  f2(logn  +  log  |^4|),  we  use  a  branching 
program  for  the  standard  matrix  multiplication  algorithm  that  forms  n2  inner  products  of 
rows  and  columns  of  the  two  matrices.  As  discussed  in  the  proof  of  Theorem  10.13.3,  a 
branching  program  can  be  constructed  to  form  the  inner  product  of  two  n-tuples  that  has 
@(n|©2)  vertices;  that  is,  space  f!(logn  +  log  |»4|)  and  time  0[n).  Concatenating  n2  of 
these  programs,  one  for  each  of  the  n 2  entries  in  the  product  matrix,  we  have  a  branching 
program  with  space  fl(log  n  +  log  |yf|)  and  time  0(n3). 

To  fill  in  the  gap  between  these  extremes,  the  method  applied  in  Theorem  10.13.3  can 
be  used,  as  the  reader  can  demonstrate.  (See  Problem  10.40.)  ■ 

10.13.5  Matrix  Inversion 

As  an  intermediate  step  to  deriving  a  space-time  product  lower  bound  on  matrix  inversion,  we 
derive  a  lower  bound  for  the  product  of  three  n  x  n  matrices.  This  is  done  by  first  deriving 
an  alternate  representation  for  this  product  in  terms  of  the  Kronecker  product  of  two  matrices. 
Kronecker  products  are  defined  in  Section  10.12. 

LEMMA  10.13.4  Let  A,  B,  C,  and  D  be  nxn  matrices  over  a  commutative  ring.  The  following 
two  equations  define  the  same  set  of  mappings  from  entries  of  A,  B,  and  C  to  entries  in  D: 

D  =  ABC 

E  =  (A®Ct)B 

where  B  and  E  are  n2x  1  column  vectors  obtained  by  concatenating  the  transposes  of  the  rows  of 
the  matrices  B  and  D,  respectively. 
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Proof  Let  E  =  (A  ®  CT)B.  The  goal  is  to  show  that  the  results  in  the  n2  x  1  column 
vector  E  are  the  same  as  those  in  the  n  x  n  matrix  D  but  in  a  different  order.  In  particular, 
we  show  that  the  ni  +  j  entry  in  the  former,  namely  eni+jti,  is  equal  to  the  (i,  j)  entry  in 
D,  namely  dij. 

Given  a  matrix  F,  let  fyj  denote  its  entry  in  the  ith  row  and  jth  column.  Let 
and  f-j  denote  the  *th  row  and  jth  column  of  F,  respectively.  Let  rows  and  columns  of 
matrices  be  numbered  consecutively  from  zero. 

The  matrix  A  G>  CT  consists  of  blocks  of  n  consecutive  rows  with  the  zth  block  con¬ 
taining  [«,;j  CT ,  ai^CT , . . . ,  a^nCT].  Thus,  the  ni  +  fth  entry  of  E,  namely  eni+jj, 
is  the  jth  entry  in  the  product  [a*  \CT ,  CLi  2  CT , . . .  ,a,i  nCT]B,  as  shown  below,  where 
(c_fyT(bk,.)T  is  the  inner  product  of  the  row  vector  (c_  f)T  with  the  column  vector 

0 h’-)T • 

n—  1 

&ni-\-j,  1  —  ^  ^  & i,k  (^—  ,j  )  (J^k,—  ) 

k= 0 

n— 1 n— 1 

=  ^  ^  y  ^  Q'i,kCl,jbk,l 

k= 0  1=0 
n— 1 n— 1 

—  y  y  y  ^ 

k=0  1=0 
=  dij 

This  is  the  desired  conclusion.  ■ 


With  this  as  background,  we  state  the  space-time  results  to  compute  the  product  of  three 
matrices. 


THEOREM  10.13.5  There  is  an  integer  no  >  0  such  that  for  n  >  no  the  time  T  and  space 
S  used  by  any  general  branching  program  to  compute  the  product  of  three  n  x  n  matrices  over  a 
commutative  ring  1Z  must  satisfy  the  following  inequality: 

ST  =  n(n4  log  |^|) 

Proof  Given  a  general  branching  program  to  compute  ABC,  no  more  space  or  time  are 
used  when  the  matrices  A  and  C  are  given  specific  values.  Let  them  each  be  7-nice  for 
some  0  <  7  <  1/2.  The  existence  of  such  matrices  is  established  in  Lemma  10.12.1. 
From  Lemma  10.12.2  we  know  that  the  matrix  A  ®  CT  is  72-ok.  The  result  follows  from 
Theorem  10.13.3  since  A  ®  CT  is  n 2  X  n2.  ■ 


We  are  now  prepared  to  state  space-time  bounds  for  matrix  inversion. 


THEOREM  10.13.6  There  is  an  integer  n0  >  0  such  that  for  n  >  no  the  time  T  and  space  S 
used  by  any  general  branching  program  to  compute  the  inverse  of  a  non-singular  n  x  n  matrix  over 
a  commutative  ring  1Z  must  satisfy  the  following  inequality: 

ST  =  n(n4  log  \1Z\) 


This  lower  bound  can  be  achieved  to  within  a  multiplicative  factor  over  the  range  il(n2)  <T< 
0(n3). 
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Proof  Let  n  be  a  multiple  of  4.  The  lower  bound  follows  by  reducing  matrix  inversion  to 
the  computation  of  the  product  of  three  arbitrary  n/4  x  n/4  matrices,  as  shown  below: 


'  I 

-A 

0 

0  ' 

—  1 

'  I 

A 

AB 

ABC  ' 

0 

I 

—B 

0 
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The  upper  bound  for  T  =  @(n2)  is  obtained  by  table  lookup  using  an  algorithm  of 
the  kind  described  in  the  proof  of  Theorem  10.13.3.  For  T  =  0(n3),  the  matrix  inversion 
algorithm  based  on  the  LDLf  decomposition  of  a  symmetric  positive  definite  matrix  of 
Section  6.5.4  can  be  used.  For  intermediate  values  of  time,  a  hybridized  algorithm  based  on 
the  inversion  of  block  matrices  provides  the  stated  upper  bound.  ■ 

10.13.6  Discrete  Fourier  Transform 

The  discrete  Fourier  transform  (DFT)  and  the  fast  Fourier  transform  algorithm  are  described 
in  Sections  6.7.2  and  6.7.3.  In  this  section  we  derive  upper  and  lower  bounds  on  space- 
time  tradeoffs  for  this  problem.  The  lower  bound  follows  from  the  result  for  matrix-vector 
multiplication  and  the  fact  that  the  coefficient  matrix  for  the  DFT  is  (1/4) -ok. 

LEMMA  I  0. 1  3.5  Consider  the  n-point  DFT  over  a  commutative  ring  that  has  a  principal  nth 
root  of  unity.  It  is  defined  as  a  matrix-vector  product  with  [uC  ]  as  its  n  x  n  coefficient  matrix. 
This  matrix  is  ( 1  / 4)  -ok. 

Proof  We  use  the  fact,  shown  in  Theorem  10.5.5,  that  the  submatrix  of  W  =  [ w l°]  con¬ 
sisting  of  any  k  rows  and  any  k  consecutive  columns  is  non-singular.  We  show  that  any  pxq 
submatrix  B  of  W,  withp  <  [n/4]  and  q  >  n  —  [n/4] ,  has  rank  at  least  p/4. 

Let  I  denote  the  row  indices  of  the  submatrix  B  and  let  J  denote  its  column  indices. 
Let  C  be  the  submatrix  of  W  with  row  indices  in  I.  Divide  the  columns  of  C  into  [n/p] 
groups  each  containing  p  columns  except  possibly  the  last  which  has  at  most  p  columns.  We 
claim  that  some  group  has  at  least  pj 2  columns  in  common  with  B.  Suppose  not.  Then 
every  one  of  the  [n/p]  groups  has  at  most  (p  —  l)/2  columns  in  common  with  B.  Thus 
B  has  at  most  x(p)  =  [n/p]  (p  —  l)/2  columns.  We  show  that  \(p)  <n  —  (n  +  3)/4< 
n  —  [n/4] .  But  this  is  a  contradiction  because  B  has  at  least  n  —  [n/4]  columns.  Since 
[n/p]  <  (n  -l-p  —  l)/p,  if  (n  +  p  —  l)(p  —  l)/2p  <  n  —  (n  +  3) / 4,  the  following  holds 
after  multiplying  both  sides  by  2p: 

.  . ,  .  3p(n  —  1) 

(n  +  p-  1  )(p-  1)  <  - - - -  or 

/(n  +  1)  \ 

— n  +  1  <  p  (  — - - pj 

It  suffices  to  show  that  the  right-hand  side  of  the  last  equation  is  positive.  But  ( (n+ 1 )  /2)  — p 
is  positive  sincep  <  [n/4]  <  (n  +  3) / 4  <  (n  +  l)/2  for  n  >  1.  ■ 

THEOREM  10.13.7  There  is  an  integer  no  >  0  such  that  for  n  >  no  the  n-point  DFT  over  a 
commutative  ring  1 Z  requires  space  S  and  time  T  with  a  branching  program  satisfying  the  following 
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lower  bound: 


ST  =  n(n2  log  \U\) 

This  lower  bound  can  be  achieved  to  within  a  constant  multiplicative  factor. 

Proof  The  upper  bound  follows  by  applying  Lemma  10.9.3  and  Theorem  10.5.5.  ■ 

10.13.7  Unique  Elements 

We  now  derive  a  lower  bound  on  the  space-time  product  for  the  sorting  problem  by  reducing 
sorting  to  the  unique-elements  problem.  The  unique  elements  problem  takes  a  list  of  values 
and  returns  in  any  order  a  list  of  the  non-repeated  elements  among  them. 

DEFINITION  10. 13.1  Let  1Z  be  a  set  with  at  least  n  distinct  elements.  The  function  /„"jque  : 
7 Zn  i — >  2k  defines  the  unique  elements  problem  ivhere  2^  is  the  power  set  ofR.n  and 
/unique  (x)  **  t^)e  set  °f  non-repeated  elements  in  the  input  string  x. 

We  emphasize  that  no  order  is  imposed  on  the  outputs  of  /U*que.  Thus,  if  a  set  of  values 
appears  in  the  output,  their  position  in  the  output  does  not  matter. 

From  Lemma  10.11.2  it  follows  that  a  lower  bound  to  ST  can  be  derived  by  restricting 
the  domain  and  discarding  outputs.  We  restrict  the  domain  by  restricting  each  input  variable 
to  values  in  a  subset  SCR  containing  n  elements.  We  also  restrict  input  tuples  to  the 
set  V  containing  at  least  n/( 2e)  unique  values  (e  is  the  base  of  the  natural  logarithm).  In 
the  following  lemma  we  show  that  \V\  >  |<S|ra/(2e  —  1)  =  (jmn,  where  4>  =  l/(2e  —  1). 
On  inputs  in  T>  the  function  fjfi}qne  has  at  least  n/{ 2e)  unique  outputs.  We  define  the 

subfunction  /restricted  :  l— *  m  =  n/(2e),  of  /^™;que  to  be  the  subfunction  obtained 

by  restricting  its  inputs  to  79  C  Sn  and  deleting  all  but  the  first  n/ (2e)  outputs,  which  are  all 
unique. 

LEMMA  I  0. 1  3.6  Let  S  be  a  set  ofn  elements.  The  fraction  <f>  of  the  input  n-tuples  over  Sn 
containing  n / (2e)  or  more  unique  elements  exceeds  l/(2e  —  1). 

Proof  We  use  simple  probabilistic  arguments.  Assign  each  n-tuple  over  Sn  probability 
1  /n".  Let  u(x)  be  the  number  of  unique  elements  in  x.  Let  Xfix)  have  value  1  if  the  zth 
element  of  S  occurs  uniquely  in  x  and  value  0  otherwise.  Then 

n 

u{  x)  =^2Xi(x) 

»=  i 

Let  E[u]  denote  the  average  value  of  u(x)  (the  sum  of  u(x)  over  x  weighted  by  its  prob¬ 
ability).  Because  the  order  of  summation  can  be  changed  without  affecting  the  sum,  we 
have 

n 

e[u(x)]  =  j2eix^ 

i=l 

E[Xi(x)]  is  also  the  probability  that  Xi  =  1.  If  Xi  =  1,  then  each  of  the  other  components 
of  x  can  assume  only  one  of  n—  1  values.  Since  the  zth  value  can  be  in  any  one  of  n  positions 
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among  input  variables  and  since  for  each  position  that  it  occupies  there  are  (n  —  l)"^1 
ways  to  fill  the  remaining  n  —  1  positions  so  that  the  ith  value  is  unique,  we  have  that 
E[Xi]  =  f(n)  where  f(n)  =  n{n  —  1  )n~l /nn  =  (1  —  l/n)n/(l  —  1  /n).  But  f(n)  is 
a  decreasing  function  of  n,  as  is  shown  by  calculating  its  derivative  and  using  the  inequality 
(1  —  x)  <  e~x  (see  Problem  10.5).  The  limit  of  f(n)  for  large  n  is  e_1  because  in  the  limit 
of  small  x  the  function  e~x  has  value  1  —  x.  It  follows  that  E[u(x)]  >  n/e. 

Let  7r  =  Pr[u(x)  >  n/( 2e)\  be  the  fraction  (or  probability)  of  the  input  n-tuples 
for  which  u(x)  >  n/{2e)).  Because  u(x)  <  n,  it  follows  that  7rn  +  (1  —  n)n/(2e)  > 
E[u{x)]  >  n/e,  from  which  we  conclude  that  7r  >  1/ (2e—  1).  (This  is  known  as  Markov’s 
inequality.)  ■ 


LEMMA  1 0. 1  3.7  Let  |«S|  =  n.  Then  /restricted  '  l—*  Sm,  m  =  n/(2e),  is  (</>,  A,  ft,  v,  r)- 

distinguishable  for  </>=!/ (2e  —  1 ),  A  =  //  =  1,  v  =  (1  —  1/ (2e)) /  log2  n,  and r(h)  =  n. 


Proof  If  /restricted  (/>>  A,  ft,  v ,  r) -distinguishable  for  (j)  =  l/(2e  —  1),  A  =  ft  =  1/2, 
v  =  (1  —  1  /(2e))/  log2  n,  and  r(b)  =  n,  then  for  at  least  (f>nn  input  tuples  and  any  a  <  An 
input  and  b  <  /i?n  output  variables  and  specified  values  for  them,  /©tricted  has  at  most 

nn-a-vb  =  nn-ae-(\-\/(2e))b 

input  n-tuples  that  are  consistent  with  these  assignments. 

The  order  of  output  values  to  ./'res  tricted  *s  irrelevant. 

Let  B  be  the  values  of  the  b  selected  and  specified  unique  outputs,  b  <  m,  and  let  A 
be  the  values  of  the  a  selected  and  specified  input  values.  The  k  values  in  B  —  A  appear  in 
input  positions  that  are  not  specified,  r  =  n  —  k  —  a  inputs  are  in  neither  A  nor  B.  We 
overestimate  the  number  of  patterns  of  inputs  consistent  with  the  a  inputs  and  b  outputs 
that  are  specified  if  we  allow  these  a  inputs  to  assume  any  value  not  in  B,  since  all  values  in 
B  are  unique.  Thus,  there  are  at  most  (n  —  b)r  ways  to  assign  values  to  these  r  inputs.  The 
k  values  in  B  —  A  are  fixed,  but  their  positions  among  the  r  +  k  non-selected  inputs  are 
not  fixed.  Since  there  are  (r  +  k)\/r\  ways  for  these  ordered  k  values  to  appear  among  any 
specific  ordering  of  the  remaining  r  non-selected  inputs  (see  Problem  10.6),  the  number  Q 
of  input  patterns  consistent  with  the  selected  and  specified  a  inputs  and  b  outputs  satisfies 
the  following  inequality: 


Q  < 


(r  +  k)\ 

- 1 -  n 

r\ 


by 


Here  r  +  k  =  n  —  a  <  n  and  k  <  b.  Below  we  bound  (r  +  k)\ / r\  by  (r  +  k)k  and  use  the 
inequality  (1  —  x)  <  e~x: 


Q  <  ( r  +  k)k(n-b)r  <nr+k  (l  -  ^  ^1  -  ^ 

n—a  —(ka/n+rb/n)  ^  n—a  —(ka/n-\-(n—a—k)b/n ) 

^  /  v  C  ^  /  L  C- 


The  exponent  e(a,  b,  k)  =  ka/n  +  (n  —  a  —  k)b/n  is  a  decreasing  function  of  a  whose 
smallest  value  is  (1  —  k/n)b.  In  turn,  this  function  is  a  decreasing  function  of  k  whose 
smallest  value  is  (1  —  b/n)b  >  (1  —  1/ (2 e))b.  As  a  consequence,  we  have 

Q  <  n"-ae-(1-1/(2e))b 

It  follows  that  /restricted  ls  (*^>  T’  v >  T) -distinguishable  for  <j>  =  1/ (2e  —  1),  A  =  yU  =  1, 
v  =  (1  —  1/ (2e)) /  log2  n,  and  r(b)  =  n.  ■ 
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b  :=  0; 

for  j  :  =  1  to  \n/S~\ 

{b  =  (j  l)S  on  the  jth  iteration.} 

begin 

for  i  :  =  1  to  S 

C[i]  :=  0; 

for  i  :  =  1  to  n 

if  b  <  Xi  <  b  +  S  then 
begin 

k :  =  Xi  —  b; 

if  C[k]  <  2  then  C[k)  :=C[k\  +  1; 

end; 

for  i  :  =  1  to  S 

if  C[i]  =  1  then  print  b  +  i; 

b  :=  b+S-, 

end 


Figure  1 0.22  A  RAM  program  for  the  unique-elements  problem  over  the  set  {1,2, .. . ,  n} 
when  n  >  S  >  O(logn).  The  input  to  the  program  is  the  n-tuple  x  in  which  Xi  is  the  ith 
entry.  The  program  uses  space  O(S). 


Invoking  Theorem  10.1 1.1,  we  have  a  quadratic  space-time  product  lower  bound.  The 
RAM  program  for  the  unique  elements  problem  given  in  Fig.  10.22  can  be  converted  to  a 
branching  program  to  obtain  an  upper  bound  on  the  space-time  product  needed  for  this 
problem,  as  shown  in  Theorem  10.13.8. 

THEOREM  10.13.8  Let  \R\  >  n.  There  is  an  integer  n o  >  0  such  that  for  n  >  no  and 
S  =  fl(logn)  the  time  T  and  space  S  used  by  any  general  branching  program  for  the  unique 
elements  function  /^™;que  :  TZn  >— >  2K  must  satisfy 

ST  =  fl(n2) 

This  lower  bound  can  be  met  to  ivithin  a  constant  multiplicative  factor  for  inputs  draivn  from  the 
set  { 1, 2, 3, ... ,  n}. 

Proof  The  lower  bound  follows  directly  from  Theorem  10.11.1.  The  upper  bound  follows 
from  an  analysis  of  the  branching  program  that  results  from  conversion  of  the  RAM  program 
in  Fig.  10.22.  The  RAM  program  makes  \n/ S~\  passes  over  the  input  data.  On  the  jth  pass 
the  program  examines  input  values  in  the  range  [(j  —  1)S, .. . ,  jS}  and  determines  for  each 
value  whether  there  are  zero,  one,  or  more  than  one  instances  of  it  in  the  input. 

The  program  uses  an  S'-element  one-dimensional  array  C\\ ..S]  that  it  initializes  to  zero 
at  the  beginning  of  each  pass.  If  on  the  jth  pass  the  ith  input  variable,  a is  in  the  interval 
[(j  —  l)^, . . . ,  jS],  the  array  element  associated  with  it,  namely  C[xi  —  (j  —  1)S],  is 
incremented  unless  it  already  has  value  2.  At  the  end  of  the  jth  pass,  if  the  array  element 
C\i\  has  value  1,  the  program  prints  out  the  value  jS+i,  namely,  the  value  of  an  input  that 
appears  only  once  in  the  input. 
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The  reader  is  asked  to  show  that  the  program  of  Fig.  10.22  can  be  converted  to  a  branch¬ 
ing  program  of  space  0(5)  and  time  O(T).  (See  Problem  10.41.)  ■ 

The  program  of  Fig.  10.22  relies  on  the  fact  that  input  variables  are  drawn  from  the  set 
{1, 2,  3, .  . . ,  n}.  If  the  set  from  which  they  are  drawn  is  much  larger,  say  { 1, 2,  3, ... ,  nc}, 
c  >  1,  the  outer  loop  is  executed  0{nc / S)  times  and  its  total  running  time  is  0(nc).  Thus, 
the  program  is  not  optimal  in  this  case. 

10.13.8  Sorting 

The  sorting  problem  is  described  in  Section  6.8.  The  general  sorting  problem  is  defined  by 
a  function  /©{.  :  lZn  i— >  lZn  that  rearranges  the  values  of  input  variables  so  they  are  in 
descending  order.  Given  a  branching  program  for  sorting,  we  show  below  that  a  branching 
program  for  the  unique-elements  problem  can  be  obtained  with  a  small  additional  amount  of 
space.  As  a  consequence,  the  space-time  product  lower  bound  for  unique  elements  applies  to 
the  sorting  problem.  We  also  give  a  nearly  matching  upper  bound. 

THEOREM  10.13.9  Let  \R\  >  n.  There  is  an  integer  Uq  >  0  such  that  for  n  >  no  and 
S  =  12  (log  n)  the  time  T  and  space  S  used  by  any  general  branching  program  for  the  sorting 
function  /j "r{  :  lZn  i— >  TZn  that  reports  its  outputs  in  descending  order  must  satisfy 

ST  =  n  (n2) 

This  lower  bound  can  be  met  to  ivithin  a  constant  multiplicative  factor  for  inputs  draivn  from  the 
set  { 1, 2, 3, ... ,  n}. 

Proof  Given  a  branching  program  for  /©(.  that  uses  space  S,  we  use  it  to  construct  a 
branching  program  for  /^ique  that  uses  space  S  +  O(logn)  =  0(5).  Since  /^”'-que 
requires  space  that  is  12 (n2/T),  the  same  lower  bound  applies  to  sorting. 

The  branching  program  for  /©{.  generates  its  sorted  outputs  in  descending  order.  By 
analyzing  the  outputs  the  unique  elements  can  be  found.  Store  the  last  output  l  along  with 
a  bit  b  that  is  1  if  l  is  so  far  the  only  occurrence  of  this  value  and  0  otherwise.  If  the  next 
output  value  is  the  same  as  l,  set  b  to  0.  If  it  is  different  from  l  and  6=1,  produce  l  as 
an  output,  replace  l  with  the  last  output,  and  set  6  to  1.  Otherwise,  do  not  produce  an 
output. 

Given  a  branching  program  II  for  sorting,  we  describe  a  branching  program  for  unique 
elements  that  uses  modified  copies  of  II.  If  more  than  one  output  appears  on  some  edge 
in  II,  modify  it  (yielding  II*)  by  replacing  edges  producing  more  than  one  output  by  a 
sequence  of  edges  each  producing  one  output  separated  by  vertices  testing  an  arbitrary  in¬ 
put.  This  increases  the  number  of  vertices  in  II  by  a  factor  of  at  most  n  and  adds  at  most 
log2  n  to  its  space.  Now  make  2\1Z\  additional  copies  of  II*,  two  for  each  value  in  1Z,  a 
“one”  copy  if  the  value  is  the  first  encountered  in  the  sorted  output  and  a  “zero”  copy  if  it 
is  not. 

Consider  an  edge  in  II*  or  one  of  its  copies  that  produces  an  output  (call  it  v).  There 
are  several  cases  to  examine:  the  current  copy  of  II*  is  a)  the  original  copy,  b)  a  “one”  copy, 
or  c)  a  “zero”  copy.  In  case  a),  redirect  the  edge  to  the  same  vertex  in  the  “one”  copy  of  II* 
associated  with  v.  In  case  b),  if  v  is  different  from  the  value  c  associated  with  the  current 
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copy  of  II*,  output  c  and  redirect  the  edge  to  the  same  vertex  in  the  “one”  copy  of  II*  as¬ 
sociated  with  v.  In  case  c),  if  v  is  the  same  as  the  value  associated  with  the  current  copy  of 
II*,  produce  no  output;  otherwise  also  produce  no  output  but  redirect  the  edge  to  the  same 
vertex  in  the  “one”  copy  of  II*  associated  with  v.  The  new  branching  program  has  at  most 
2n  +  1  copies  of  II*,  thereby  increasing  its  space  by  an  additive  term  of  size  0(log  n).  The 
lower  bound  on  ST  for  the  sorting  problem  follows. 

The  upper  bound  on  ST  for  the  sorting  problem  is  obtained  by  constructing  a  family  of 
branching  programs,  one  for  each  value  of  S.  We  begin  by  constructing  a  “full”  branching 
program  for  the  case  S  =  0(n).  Let  the  variables  in  the  input  string  be  X\,  X2,  ■  ■  ■ ,  xn  and 
let  them  be  tested  in  sequence.  Thus,  the  root  is  labeled  Xi  and  has  n  successors,  each  of 
which  tests  X2 .  There  is  one  successor  for  each  vertex  labeled  with  X2  for  each  way  two  num¬ 
bers  can  be  chosen  with  replacement  from  the  set  {1, 2, .  . . ,  n}.  As  shown  in  Problem  10.7, 
there  are  N(n,  k )  ways  in  which  k  numbers  can  be  drawn  from  a  set  of  n  elements  with 
replacement  where  the  order  among  the  numbers  is  unimportant  and 

^(n+n 

Thus,  N(n,  1)  =  n  and  N(n,  2)  =  (n  +  \)n/2.  The  successors  to  vertices  labeled  X2  are 
labeled  X3.  They  have  N(n,  3)  successors,  and  so  on.  At  the  fcth  level  there  are  N (n,  k)  suc¬ 
cessors.  Since  N(n,  k )  <  2n+fe~1,  it  follows  that  for  k  <  n  the  above  branching  program 
has  0{ 2ln)  vertices  or  space  S  =  Q(n).  It  also  has  time  T  =  n  and  space-time  product 
0(n2). 

To  construct  a  branching  program  for  space  S  =  O(n),  we  use  0(n/ S)  pruned  copies 
of  the  full  branching  program  described  above.  The  idea  behind  the  pruning  is  the  fol¬ 
lowing:  we  scan  the  input  list  looking  for  variables  with  values  in  the  set  { 1, 2, . . . ,  S}.  If 
there  are  O(S)  of  them,  we  record  the  number  of  values  of  each  type  and  produce  them  in 
sorted  order.  However,  if  there  are  more  than  O(S)  elements  in  this  range,  as  we  examine 
additional  inputs  we  reduce  the  size  of  the  range  so  that  only  O(S)  space  is  used  to  carry 
the  number  of  values  of  variables  encountered.  (This  space  is  represented  by  2chsl  vertices 
in  the  branching  program.)  On  each  pass  through  the  input  either  we  reduce  the  size  of 
the  range  by  O(S)  or  reduce  the  number  of  outputs  that  must  be  produced  by  the  same 
amount.  Thus,  after  2 n/S  passes  the  input  is  sorted.  Since  each  pass  tests  the  value  of  each 
variable,  the  time  is  0(n2/S). 

It  is  not  difficult  to  convert  the  above  schema  into  a  branching  program.  The  goal  is  to 
have  no  more  than  about  2s  vertices  on  each  level  of  the  branching  program.  The  branching 
program  will  consist  of  0(n/ S)  copies  of  the  full  branching  program,  each  having  n  levels. 
Thus,  the  branching  program  will  have  0(n2 2s / S)  vertices  or  space  O(S). 

We  order  vertices  at  each  level  in  the  branching  program,  placing  those  with  smaller 
input  values  to  the  left.  We  remove  vertices  at  the  jth  level  that  correspond  to  input  values 
larger  than  S  as  well  as  those  to  the  right  of  the  first  2s  vertices  on  the  jth  level.  Each  edge 
in  the  first  full  branching  program  that  is  directed  into  a  removed  vertex  is  redirected  to  the 
root  of  the  next  copy  of  the  branching  program.  The  second  copy  of  the  full  branching 
program  is  pruned  to  remove  the  vertices  appearing  in  the  first  copy  as  well  as  those  reached 
on  inputs  outside  the  range  [S'  +  1,  S  +  2, ... ,  2S],  The  edges  directed  to  removed  vertices 
are  redirected  to  the  root  of  the  third  copy  of  the  full  branching  program.  A  similar  process 
is  applied  to  each  copy  of  the  full  branching  program.  ■ 
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Problems 

MATHEMATICAL  PRELIMINARIES 

10.1  Show  that  the  the  pyramid  graph  on  m  inputs,  P(m),  has  m(m  +  1) /2  vertices.  Let 
n  =  m(m  +  l)/2.  Show  that  m  >  \j2n  —  1. 

10.2  Show  that  the  following  inequalities  hold  for  integers  TO  and  x: 

mix  <  \m/x]  <  (to  +  x  —  \)/x 
(to  —  x  +  l)/x  <  [m/x J  <  m/x 

10.3  Suppose  that  p  log2  p  <  q  for  positive  integers  p,  q  >  2.  Show  that  p  <2q/  log2  q. 

10.4  For  n  positive  integers  X\,  x2, . . .,  xn,  show  that  the  following  inequality  holds  between 
the  geometric  mean  on  the  left  and  the  arithmetic  mean  on  the  right: 

(x\X2  ■  ■  ■  xn)1/n  <  (x\  +  x2-\ - 1-  xn)/n 

10.5  Show  that  the  inequality  (1  —  x)  <  e~x  holds  for  x  <  1. 

10.6  Show  that  there  are  (r  +  k)\/r\  ways  for  k  ordered  values  to  appear  among  r  distinct 
ordered  items. 

10.7  Show  that  there  are  N(n,  k)  =  (n+*_I)  <  2rl+fe~~1  ways  to  choose  with  repetition  k 
numbers  from  a  set  A  of  size  n  where  the  order  among  the  numbers  is  unimportant. 
Choosing  with  repetition  means  that  a  number  can  be  chosen  more  than  once. 

Hint:  Without  loss  of  generality,  let  A  =  {1,2, ...  ,n}.  Since  order  is  unimportant, 
assume  the  chosen  numbers  are  sorted.  Let  each  chosen  number  be  represented  by  a 
blue  marker.  Imagine  placing  the  blue  markers  on  a  horizontal  line.  For  1  <  i  <  n—  1, 
place  a  red  marker  between  the  last  blue  marker  associated  with  the  number  i  and  the 
first  blue  marker  associated  with  the  number  i  +  1 ,  if  any.  This  representation  uniquely 
determines  the  number  of  elements  of  each  type  chosen.  How  many  ways  can  the  red 
markers  be  placed? 

10.8  Show  that  a  complete  balanced  binary  tree  on  2fc_1  leaves  has  2k  —  1  vertices  including 
leaves  and  that  each  path  from  a  leaf  to  the  root  has  k  —  1  edges  and  k  vertices. 

THE  PEBBLE  GAME 

10.9  Consider  the  circuit  shown  in  Fig.  2.15.  Treat  each  gate  and  each  input  vertex  as  a 
vertex.  Give  a  good  pebbling  strategy  for  this  graph. 

10.10  Give  a  pebbling  strategy  for  the  TO-input  counting  circuit  in  Fig.  2.21(b)  that  uses 
0(log“  to)  pebbles  and  0(rn)  steps.  Determine  the  minimum  number  of  pebbles 
with  which  the  circuit  can  be  pebbled.  Determine  the  number  of  steps  needed  with 
this  minimal  pebbling. 
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SPACE  LOWER  BOUNDS  WITH  PEBBLING 

10.1 1  Consider  the  FFT  graph  F'k'  on  m  =  2k  inputs.  Show  that  the  subgraph  connecting 
inputs  to  any  one  output  is  a  complete  binary  tree  on  m  leaves. 

10.12  Consider  a  directed  acyclic  graph  with  n  vertices,  some  of  which  have  out-degree  greater 
than  2.  (a)  Show  that  if  each  vertex  of  out-degree  k  >  2  is  replaced  by  a  binary  tree 
with  k  leaves  and  edges  directed  from  the  root  to  the  leaves,  the  number  of  vertices  in 
the  graph  is  at  most  doubled,  (b)  Show  that  replacing  vertices  with  in-degree  greater 
than  2  with  binary  trees  also  at  most  doubles  the  number  of  vertices  in  the  graph. 

EXTREME  TRADEOFFS  WITH  PEBBLING 

10.13  Let  N(k)  be  the  number  of  vertices  in  the  graph  H k  discussed  in  Section  10.3.  Show 
that  the  following  recurrence  holds  for  N(k ): 

N(k)  =  N(k  -  1)  +4fc  +  3 

Show  that  N(k)  =  2 k2  +  5k  —  6  for  k  >  2  since  N(2)  =  12. 

10.14  Construct  a  new  family  {G&}  of  graphs  with  fan-in  2  at  each  vertex  from  the  graphs 
{Hk}  by  replacing  the  tree  in  Fig.  10.4  by  a  pyramid  graph  in  k  inputs  and  the  bipartite 
graph  with  the  graph  Ek  shown  in  Fig.  10.23.  Show  that  each  output  of  E k  can  be 
pebbled  with  k  pebbles  but  that  after  pebbling  any  one  output  there  is  at  least  one  path 
without  pebbles  between  the  input  and  every  other  output.  Show  also  that  with  k  +  1 
pebbles  Ej~  can  be  pebbled  without  repebbling  any  vertex. 

Let  Tk(S)  be  the  number  of  steps  to  pebble  Gk  with  S  pebbles.  Using  the  above  facts, 
show  the  following: 

a)  N{k)  =  \Gk\  =  0(n4) 

b)  S'miii(Gfc)  =  k 

c)  Tk{k  +  1)  =  N{k) 

d)  Tk{k)  =  2n(JV(fc)1/4  los  JV(fc)) 
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SPACE-TIME  LOWER  BOUNDS  WITH  PEBBLING 

10.15  Let  A  be  a  7-nice  n  x  n  matrix  over  a  ring  72  for  some  0  <  7  <  1  / 2.  Show  that  the 

matrix- vector  multiplication  function  f^'yX  '■  72"  7 Zn  that  maps  the  input  n- tuple 

x  to  the  output  ro-tuple  Ax  is  (1,  n2  +  n,  n,  ryn) -independent. 

10.16  Use  Lemma  10.12.1  and  the  result  of  the  previous  problem  to  show  that  for  almost 
all  n  X  n  matrices  A  every  straight-line  program  for  the  matrix-vector  multiplication 
function  f^x  '  ^2™  1—1 '  72”  over  the  ring  72  requires  space  S  and  time  T  satisfying 
the  inequality 


(S  +  1)T  =  U(n2) 

Furthermore,  show  that  a  straight-line  program  for  matrix-vector  multiplication  can  be 
realized  with  space  S  =  3  and  time  T  =  n(2n  —  1),  that  is,  with 

(5  +  1)T  =  0{n2) 

10.17  Linear  systems  are  described  in  Section  6.2.2.  A  linear  system  of  n  equations  in  n 
unknowns  x  is  defined  by  an  ( n  x  n) -coefficient  matrix  A  and  an  n-vector  b,  as 
suggested  below: 


Ax  =  b  (10.11) 

The  goal  is  to  solve  this  equation  for  x.  If  A  is  non-singular,  such  a  solution  exists  for 
each  vector  b.  Let  /^  .  :  7Zn  +ra  1— >  7 Zn  denote  the  linear  system  solver  function 
that  maps  the  matrix  A  and  the  vector  b  onto  the  solution  x  when  the  matrix-vector 
multiplication  is  over  the  ring  1Z  and  A  is  non-singular. 

Show  that  every  pebbling  strategy  for  every  straight-line  program  to  compute  the  linear 
system  solver  function  x b  '■  72. 71  l_ >  72"  over  the  ring  72  for  n  even  requires  space 
S  and  time  T  satisfying  the  following  inequality: 

(5+  1)T>  n3/24 

Hint:  Would  it  be  possible  to  violate  the  lower  bound  on  (S+1)T  for  matrix  inversion 
given  in  Problem  10.25  if  a  DAG  for  the  linear  system  solver  function  can  be  pebbled 
with  S  pebbles  in  too  few  steps? 

10.18  Let  /  :  An  1— >  Am  have  g  :  Ar  As  as  a  subfunction.  Show  that  if  g  is  (a,  r,  s,p)- 
independent  for  r  <  n  and  s  <  m,  then  so  is  /.  Show  that,  as  a  consequence,  the 
space  S  and  time  T  needed  to  pebble  the  graph  of  a  straight-line  program  for  /  satisfy 
the  following  inequality: 


[a (S'  +  l)]  T  >  sp/4 

10.19  Show  that  if  a  function  is  (a,  n,  m,p)  -independent,  it  is  also  (a,  n,  m,  q) -independent 
for  q  <  p. 

Hint:  Consider  the  same  set  V  of  outputs  in  the  two  definitions. 
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A  finite-state  machine  M  computes  the  function  '■  Q  x  £n  i— >  tb™  that  maps 
the  initial  state  in  Q  and  an  input  string  x  of  length  n  over  the  input  alphabet  £  onto 
an  output  string  y  of  the  same  length  over  the  output  alphabet  Such  a  machine 
can  compute  a  function  /  :  An  i— >  An  by  associating  inputs  and  outputs  of  /  with 
inputs  and  outputs  of  .  A  computation  of  an  FSM  M  of  a  function  /  is  input- 
output  oblivious  if  the  times  at  which  inputs  of  /  are  read  and  its  outputs  produced 
are  independent  of  the  value  of  its  input  variables. 

Show  that  Theorem  10.4. 1  can  be  generalized  from  straight-line  computations  to  com¬ 
putations  by  input-output-oblivious  FSMs. 

Hint:  Try  to  parallel  the  proof  of  Theorem  10.4.1  using  the  FSM  M  instead  of  the 
pebble  game.  What  correspondence  can  you  make  between  the  values  under  pebbles 
before  the  interval  X  and  the  state  of  M?  Let  log2  |Q|,  where  Q  is  the  set  of  states  of 
M,  be  the  measure  of  space  associated  with  it. 

Give  a  design  of  an  FSM  that  computes  a  function  /  from  straight-line  programs  for  it 
using  a  number  of  steps  and  storage  locations  proportional  to  the  time  and  space  used 
by  a  pebbling  strategy  for  this  straight-line  program. 

Hint:  Design  the  FSM  so  that  it  receives  the  inputs  provided  to  the  pebbling  strategy 
as  well  as  instructions  to  specify  which  operations  are  performed  on  the  inputs  and 
temporary  storage  locations  of  the  FSM. 

TRANSITIVE  FUNCTIONS 

10.22  Many  functions  for  which  space-time  lower  bounds  have  been  derived  are  transitive. 
Such  functions  have  the  property  that  for  subsets  X  and  Y  of  their  inputs  and  outputs, 
respectively,  \X\  =  \Y \  =  n,  the  (control)  inputs  not  in  X  can  be  chosen  so  as  to  cause 
the  outputs  in  Y  to  be  equal  to  an  arbitrary  permutation  drawn  from  the  set  G(n ) 
of  the  inputs  in  X.  For  example,  the  cyclic  shifting  function  studied  in  Section  2.5.2 
has  a  set  of  control  inputs  that  specify  the  amount  by  which  value  inputs  are  permuted 
cyclically  and  assigned  to  the  output  variables. 

DEFINITION  10.13.2  LetG(n)  be  a  group  of permutations  of the  integers  ]N(n)  =  {0, 
1,2 ,...,n—  1}.  That  is,  if  tt  is  in  G(n),  then  n  :  M(?x)  i— >  3M(n).  We  denote  by  7 r(j) 
the  integer  to  which  integer  i  is  mapped  by  n.  A  function  fc(n)  '■  An+S  <— >  An,  where 
(yn- 1 ,  •  •  • ,  yi ,  2/0 )  =  f G{n)  {xn- 1,  •  ■  • .  xu  x0,  cs_i, . . . ,  Co),  is  said  to  have  value  in¬ 
puts  xn-i, . . . ,  X\,  Xq,  control  inputs  cs_i, . . . ,  Co,  and  outputs  yn- 1,  •  •  • ,  V\>  yo- 
Such  a  function  is  transitive  of  order  n  with  respect  to  the  group  G(n)  if 

a)  For  each  0  <  *  <  n  —  1  and  0  <  j  <  n  —  1,  there  exists  a  permutation  7 r  £  G(n) 
such  thattrif)  =  j,  and 

b)  For  each  tt  £  G(n),  there  is  an  assignment  to  cs_i, . . . ,  Co  such  thaty =  Xi  for 
0  <  i  <  n-  1. 

Show  that  every  transitive  function  of  order  n  with  respect  to  the  permutation  group 
G(n),  fG{n)  '■  An+S  i— ►  An,  is  (2,  n  +  s,  n,  n/2) -independent. 

10.23  Show  that  the  cyclic  shifting  function  /^"[2ic  :  ^"+ri°srll  i_>  Bn  defined  in  Sec¬ 
tion  2.5.2  is  transitive  of  order  n. 


10.20 


10.21 
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10.24  Consider  the  function  fp^Q  '■  TZin  >—>lZn  whose  value  is  the  product  PAQ  of  three 
n  X  n  matrices  P,  A,  and  Q.  Let  P  and  Q  be  permutation  matrices  whose  entries  serve 
as  control  inputs.  Show  that  fp^Q  *s  transitive  of  order  n2. 

10.25  The  matrix  inversion  function  ,/©l ,  :  lZn  i— >  7 Zn  maps  a  non-singular  n  x  n  matrix 

over  the  ring  1Z  to  its  inverse.  (See  Section  6.3.)  Show  that  /©l,  is  (2,  n2,  n,  n/2)- 
independent. 

Hint:  Show  that  contains  as  a  subfunction  the  function  fpl\q  :  R?n  i— >  lZn 

defined  in  Problem  10.24.  In  this  connection  consider  the  following  identity,  which 
holds  when  the  n  x  n  matrices  R  and  S  are  non-singular: 


’  R 

A  ' 

-1 

'  R-1 

i - 

7 

7 

i 

0 

5 

0 

5"1 

PEBBLING  SUPERCONCENTRATORS 

10.26  Show  that  the  graph  consisting  of  two  n  =  2d-input  FFT  graphs  connected  back 
to  back  (as  shown  in  Fig.  10.24  with  the  second  FFT  graph  reversed)  is  a  supercon¬ 
centrator.  (Valiant  [343]  has  shown  the  existence  of  n-superconcentrators  with  O(n) 
vertices.) 

Hint:  Reason  that  there  are  unique  vertex-disjoint  paths  from  any  r  input  vertices  of 
this  graph  to  any  r  consecutive  vertices  that  are  simultaneously  outputs  of  the  first 
FFT  graph  and  the  inputs  to  the  reversed  FFT  graph.  The  first  and  last  vertices  are 
consecutive. 

10.27  Prove  that  to  pebble  any  5+1  outputs  of  an  n-superconcentrator,  S  +  1  <  n,  from  an 
initial  placement  of  S  pebbles  requires  that  at  least  n  —  S  different  inputs  be  pebbled. 
Hint:  Suppose  that  at  most  n  —  (S  +  1)  inputs  are  pebbled  from  an  initial  placement 
of  S  pebbles  to  pebble  5+1  outputs.  Can  you  reason  from  the  superconcentration 


Figure  10.24  Two  back-to-back  FFT  graphs  form  a  superconcentrator. 
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property  that  S'  +  1  or  more  inputs  cannot  remain  unpebbled  since  S  +  1  outputs  are 
pebbled? 

1 0.28  Use  the  result  of  the  previous  problem  to  show  that  to  pebble  an  n-superconcentrator 
with  S  pebbles  in  time  T  requires  S  and  T  to  satisfy  the  following  inequality: 

n2 

(S  +  l)T  >  — 

Hint:  As  in  the  proof  of  Theorem  10.4.1,  divide  time  up  into  consecutive  intervals. 
Choose  the  intervals  so  that  each  has  the  same  number  of  outputs  pebbled  during  it. 
Apply  the  results  of  the  previous  problem  to  obtain  a  lower  bound  on  the  sum  of  the 
number  of  input  and  output  vertices  that  are  pebbled  during  the  interval. 

10.29  Show  that  the  pebbling  of  two  n-input  back-to-back  FFT  graphs  requires  space  and 

time  that  satisfy  S2T  =  and  that  this  lower  bound  can  be  achieved  up  to  a 

multiplicative  factor. 

Hint:  From  the  proof  of  Lemma  10.5.4  it  follows  that  to  pebble  any  2 S  outputs  with 
S  pebbles  at  least  n  —  S  +  1  inputs  must  be  pebbled  because  if  fewer  inputs  need  be 
pebbled  the  outputs  can  have  more  values  than  is  possible  for  the  FFT. 


APPLICATIONS  OF  THE  GRIGORIEV  LOWER  BOUND 

10.30  Show  that  there  is  a  pebbling  for  a  straight-line  program  for  the  cyclic  shift  func- 
tion  /cyclic  :  £n+riog”1  Bn  examined  in  Section  10.5.2  for  which  (5  +  1  )T  = 
0(n2  log  n). 

Hint:  Pebble  the  graph  of  the  circuit  described  in  Section  2.5.1.  Construct  a  circuit  for 
/cyclic  that  produces  each  output  with  0(n  log  n)  gates. 

10.31  Show  that  the  binary  addition  function  f^d  (see  Section  2.7)  can  be  realized  by  a 
straight-line  program  using  space  and  time  satisfying  ST  =  0(n). 

10.32  Derive  upper  and  lower  bounds  on  the  product  (S  +  1  )T  for  pebblings  of  circuits  for 
the  squaring  function  /Jquare  that  are  within  a  factor  of  0( log2  n)  of  one  another. 

10.33  Derive  good  upper  and  lower  bounds  on  the  product  (S+l)T  for  pebblings  of  circuits 
for  the  reciprocal  function  . 

10.34  In  Section  6.5.3  a  straight-line  algorithm  is  given  to  invert  an  n  x  n  triangular  matrix. 
Construct  another  straight-line  algorithm  based  on  it  that  can  be  pebbled  with  O(n) 
pebbles  to  produce  outputs  by  columns  in  0(n3)  steps  under  the  assumption  that  the 
standard  matrix  multiplication  algorithm  is  used  for  the  matrix  multiplication  steps. 
Hint:  To  produce  outputs  of  a  triangular  matrix  T  by  columns  using  the  algorithm  of 
Fig.  6.5,  it  is  necessary  to  read  the  elements  of  Xfy  by  rows  and  produce  the  outputs  of 
T2 fy1  by  rows.  Consider  modifying  this  algorithm  to  generate  the  elements  of  the  latter 
matrix  first  by  rows  and  then  by  columns. 
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BRANCHING  PROGRAMS 

10.35  Give  a  proof  of  Lemma  10.9.1  by  a)  designing  a  general  branching  program  to  simulate 
a  comparison  operator  and  b)  using  this  design  in  a  complete  branching  program  that 
simulates  a  decision  branching  program. 

10.36  In  Section  10.9  a  procedure  is  given  to  convert  a  general  branching  program  to  a  tree 
program  without  increasing  the  length  of  any  path.  Use  this  fact  to  show  that  every 
decision  branching  program  with  queries  {<,=}  that  sorts  a  list  of  n  items  requires 
worst-case  time  of  at  least  (n/ 2)  log(n/2)  when  n  is  even.  Show  that  this  lower  bound 
can  be  achieved  up  to  a  constant  multiplicative  factor. 

Hint:  Show  that  every  binary  tree  with  m  leaves  must  have  a  longest  path  of  length 
at  least  log2  m  and  determine  the  number  of  distinct  leaves  necessary  in  every  decision 
branching  program  for  sorting. 

THE  B0R0DIN-C00K  LOWER-BOUND  METHOD 

10.37  The  computation  time  of  a  branching  program  is  the  length  of  the  longest  path  in  its 
directed  acyclic  multigraph.  Assume  that  a  probability  is  assigned  to  each  input  x  of 
length  n.  The  average  computation  time,  T,  of  a  branching  program  is  the  sum  of 
the  lengths  of  the  paths  associated  with  different  inputs  weighted  by  the  probabilities  of 
these  inputs.  To  compute  the  average  space  of  a  branching  program  with  k  vertices,  the 
integers  in  the  set  { 1,  2, . . . ,  k}  are  assigned  to  the  vertices  of  the  branching  program. 
The  space  associated  with  input  x  is  the  base-2  logarithm  of  the  largest  such  integer 
encountered  during  the  computation  associated  with  x.  The  average  space  associated 
with  a  numbering  of  vertices  is  the  average  of  this  logarithm.  The  average  space,  S, 
associated  with  a  branching  program  is  the  smallest  average  space  over  all  numberings 
of  vertices. 

Given  a  probability  distribution  on  inputs  of  length  n,  let  Cf(a,  b)  denote  the  maxi¬ 
mum  over  all  those  tree  branching  programs  of  depth  a  of  the  probability  that  b  of  the 
m  outputs  of  the  function  /  are  computed  correctly.  Show  that  Theorem  10.11.1  can 
be  generalized  to  the  above  probabilistic  setting. 

Hint:  If  T  is  the  average  time  of  the  branching  program  P,  truncate  the  branching 
program  at  depth  2T,  call  the  new  program  P* ,  and  show  that  P*  solves  the  problem 
solved  by  P  with  probability  at  least  1/2.  Also,  show  that  with  probability  at  least  1/2 
there  exists  a  rich  path  in  some  stage  that  produces  b  =  \m/ cf\  outputs.  Let  pi  be 
the  probability  that  the  subtree  with  root  i  in  some  stage  correctly  produces  b  outputs. 
Now  develop  an  upper  bound  in  terms  of  the  pi  on  the  probability  that  some  tree  in 
some  stage  correctly  produces  b  outputs. 

APPLICATIONS  OF  THE  B0R0DIN-C00K  LOWER  BOUND 

10.38  Show  that  the  branching  program  in  Fig.  10.20  computes  the  inner  product  of  two  3- 
element  sequences  over  the  set  of  integers  modulo-2;  that  is,  the  integers  {0,  1}  with 
the  EXCLUSIVE-OR  function  for  addition  and  the  AND  function  for  multiplication. 

10.39  Complete  the  proof  of  Theorem  10.13.2  by  filling  in  the  details  of  the  construction  of 
a  branching  program  for  integer  multiplication  for  the  middle  range  of  space. 
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10.40  Complete  the  proof  of  Theorem  10.13.4  by  showing  that  two  n  x  n  matrices  can 
be  multiplied  with  a  hybrid  algorithm  that  combines  table  lookup  with  the  standard 
matrix  multiplication  algorithm  on  k  x  k  blocks  to  achieve  space  and  time  satisfying 

ST2  =  0(n3  log  \1Z\) 

10.41  Show  that  the  RAM  program  described  in  Fig.  10.22  can  be  converted  to  a  branching 
program  of  space  O(S)  and  time  0{T). 

Chapter  Notes 

The  first  formal  study  of  space-time  tradeoffs  was  made  by  Cobham  [73].  He  considered 
computations  on  one-tape  Turing  machines  using  as  a  space  measure  the  logarithm  of  the 
number  of  configurations,  and  obtained  quadratic  lower  bounds  on  the  space-time  product  to 
recognize  strings  representing  palindromes  and  perfect  squares. 

The  pebble-game  model  was  implicitly  used  by  Paterson  and  Hewitt  [239]  to  study  pro¬ 
gram  schemas,  uninterpreted  graphs  representing  programs.  They  derived  the  space  lower 
bound  of  Lemma  10.2.1,  thereby  demonstrating  that  recursive  programs  are  more  power¬ 
ful  than  nonrecursive  ones.  Cook  [75,79]  asked  how  much  space  (how  many  pebbles)  was 
needed  to  execute  a  program  schema  with  n  vertices  and  obtained  the  result  for  pyramids  of 
Lemma  10.2.2,  showing  that  the  minimum  space  is  at  least  fi(fyn)  for  some  schemas.  The 
minimum-space  question  was  answered  by  Hopcroft,  Paul,  and  Valiant  [140],  who  proved 
Theorem  10.7.1,  and  Paul,  Tarjan,  and  Celoni  [246],  who  obtained  Theorem  10.8.1.  The 
pebble  model  first  formally  appeared  in  [140].  Gilbert,  Lengauer,  and  Tarjan  [115]  and  Loui 
[205]  have  shown  that  the  languages  associated  with  minimal  pebblings  of  DAGs  (described 
at  the  end  of  Section  10.2)  are  PSPACE  -complete. 

In  addition  to  studying  the  minimum  space  needed  for  a  computation,  researchers  also 
examined  tradeoffs  between  space  and  time.  Paterson  and  Hewitt  [239]  studied  the  conversion 
of  a  linear  recursive  program  schema  into  a  non-recursive  one  and  demonstrated  that  the  time 
needed  satisfies  T  =  fl(n1+1/(-‘s~1))  for  S  >  2.  (See  Chandra  [66]  and  Swamy  and  Savage 
[321])  for  more  details  on  this  problem.) 

A  number  of  other  authors  have  identified  graphs  exhibiting  non-trivial  exchanges  of  space 
for  time.  Pippenger  [254]  gave  a  graph  on  n  vertices  for  which  T  =  H(nloglogn)  when 
S  =  0(n/  logn),  and  Savage  and  Swamy  [293]  demonstrated  that  the  FFT  graph  requires  S 
and  T  satisfying  ST  =  0(n2).  (This  is  the  first  tradeoff  result  for  a  natural  algorithm.  Their 
upper  bound  is  given  in  Theorem  10.5.5.)  Later  Tompa  [333]  and  Reischuk  [279]  exhibited 
graphs  requiring  T  =  fi(nlogn)  and  T  =  ffynlog*  n )  for  any  integer  t,  respectively,  when 
S  =  0(n/logn). 

Paul  and  Tarjan  [245],  Lingas  [201],  and  van  Emde  Boas  and  van  Leeuwen  [349]  gave 
graphs  with  T  increasing  from  0{n)  to  T  =  2 T  =  2n(n‘/3),  and  T  =  2n("'/4  los»), 
respectively,  when  S  drops  by  a  constant  amount  from  S  =  0(n 1//2),  S  =  0(v}^)  and 
S  =  0(v}!A),  respectively.  Theorem  10.3.1  is  from  [349],  as  is  Problem  10.14.  Carl¬ 
son  and  Savage  [64]  took  a  different  tack  and  exhibited  graphs  for  which  T  is  superlinear, 
namely,  T  =  2n(losn  log  logn)  over  a  range  of  values  of  S,  namely,  fi(logn)  <  S  < 
0(n1/2/ log  n).  References  to  the  worst-case  exchange  of  space  for  time  are  given  in  Sec¬ 
tion  10.6. 
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Grigoriev  [121]  gave  the  first  space-time  lower  bounds  that  apply  to  all  graphs  for  a  prob¬ 
lem  (see  Corollary  10.4.1),  the  essential  idea  ofwhich  is  generalized  in  Theorem  10.4.1.  Savage 
[291]  introduced  the  w(u,  w)-flow  measure  used  in  this  version  of  a  theorem  to  derive  lower 
bounds  on  area-time  tradeoffs  for  VLSI  algorithms.  Grigoriev  [121]  also  established  Theo¬ 
rem  10.4.2  and  derived  a  tradeoff  lower  bound  on  polynomial  multiplication  that  is  equiva¬ 
lent  to  Theorem  10.5.1  on  convolution.  The  improved  version  of  Theorem  10.4.2,  namely 
Theorem  10.5.4,  is  original  with  this  book. 

Lower  bounds  using  the  Grigoriev  approach  explicitly  require  that  the  sets  over  which 
functions  are  defined  be  finite.  Tompa  [331,332]  eliminated  the  requirement  for  finite  sets  but 
required  instead  that  functions  be  linear.  Using  concentrator  properties  of  matrices  deduced 
by  Valiant  [343],  Tompa  derived  a  lower  bound  on  ST  for  superconcentrators  that  he  applied 
to  matrix-vector  multiplication  and  polynomial  multiplication.  He  developed  a  similar  lower 
bound  for  the  DFT.  (See  Abelson  [2]  for  a  generalization  of  some  of  these  results  to  continuous 
functions.)  The  lower  bound  of  Theorem  10.5.5  uses  Tompa’s  DFT  proof  but  does  not  require 
that  straight-line  programs  be  linear. 

The  result  on  cyclic  shift  (Theorem  10.5.2)  is  due  to  Savage  [292].  (This  paper  also  gener¬ 
alizes  Grigoriev’s  model  to  I/O-oblivious  FSMs,  extends  JaJa’s  [147]  space-time  lower  bound 
for  matrix  inversion,  and  derives  space-time  lower  bounds  for  transitive  functions  and  banded 
matrices.)  The  result  on  integer  multiplication  (Theorem  10.5.3)  is  due  to  Savage  and  Swamy 
[294],  In  [331]  Tompa  also  obtained  Theorem  10.5.6  on  merging.  Transitive  functions  de¬ 
fined  in  Problem  10.22  were  introduced  by  Vuillemin  [355]. 

In  [333]  Tompa  examined  the  graph  associated  with  the  algorithm  for  transitive  closure 
based  on  successive  squarings  described  in  Section  6.4  and  demonstrated  that  it  can  be  peb¬ 
bled  either  in  a  polynomial  number  of  steps  or  with  small  space,  namely  0( log2  n),  but  not 
both.  Carlson  [61]  demonstrated  that  algorithms  for  convolution  based  on  FFT  graphs  (see 
Section  6.7.4)  require  that  T  =  @(n3 / S2  +  n2 (log  n) / S) ,  which  doesn’t  come  close  to 
matching  the  lower  bound  of  Theorem  10.5.1.  However,  through  the  judicious  replacement 
of  back-to-back  FFT  subgraphs  in  the  standard  convolution  algorithm,  Carlson  [62]  was  able 
to  achieve  the  bounds  T  =  Q(n  log  S  +  n2(log  S ) /S),  which  are  optimal  over  all  FFT-based 
convolution  algorithms  and  nearly  as  good  as  the  T  =  Q(n2/S)  bounds.  (See  also  [63].) 
Carlson  and  Savage  [65]  explored  for  a  number  of  problems  the  size  of  the  smallest  graphs  that 
can  be  pebbled  with  a  small  number  of  pebbles  and  demonstrated  a  tradeoff  between  size  and 
space. 

Pippenger  [251]  has  surveyed  many  of  the  results  described  above  as  well  as  those  on  the 
black-white  pebble  game  described  below. 

Several  extensions  of  the  pebble  game  have  been  developed.  One  of  these  is  the  red-blue 
pebble  game  discussed  in  Chapter  1 1  and  its  generalization,  the  memory  hierarchy  game. 
Another  is  the  black-white  pebble  game  whose  rules  are  the  following:  a)  a  black  pebble  can  be 
placed  on  an  input  vertex  at  any  time  and  on  a  non-input  vertex  only  if  its  predecessors  carry 
pebbles,  whether  white  or  black;  b)  a  black  pebble  may  be  removed  at  any  time;  c)  a  white 
pebble  can  be  placed  on  a  vertex  at  any  time;  d)  a  white  pebble  can  be  removed  only  if  all  its 
predecessors  carry  pebbles.  The  placement  of  white  pebbles  models  a  non-deterministic  guess. 
The  removal  of  a  white  vertex  is  allowed  only  when  the  guess  has  been  verified.  Questions 
this  game  makes  possible  are  whether  the  minimum  space  required  for  a  graph  is  lower  with 
the  black-white  pebble  game  than  with  the  standard  game  and  whether  for  a  given  amount  of 
space,  the  time  required  is  lower.  The  black-white  game  was  introduced  by  Cook  and  Sethi 
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[78],  who  showed  that  the  minimum  space  for  the  pyramid  graph  is  at  least  yj N/ 2—  1.  Meyer 
auf  der  Heide  [222]  proved  that  this  minimum  space  is  at  most  \n/ 2]  +  2  and  established  in 
general  that  any  graph  with  minimum  space  n  in  the  black-white  game  has  minimum  space  at 
most  (n2  —  n)  /2  +  1  in  the  standard  game.  The  latter  result  is  the  pebbling  analog  of  Savitch’s 
theorem  (Theorem  8.5.5). 

Loui  [206]  and  Meyer  auf  der  Heide  [222]  have  shown  that  the  minimum  space  with  the 
black-white  game  is  at  least  one  half  that  for  the  standard  pebble  game  for  balanced  trees,  a 
result  extended  by  Lengauer  and  Tarjan  [196]  to  all  trees  and  then  by  Klawe  [167].  Wilber 
[363]  has  exhibited  an  infinite  family  of  graphs  for  which  the  black-white  minimum  space  is 
smaller  than  the  minimum  space  with  the  standard  game  by  more  than  a  constant  factor. 

All  of  the  pebble  games  mentioned  above  are  one-person  games;  that  is,  one  person  plays 
the  game.  A  two-person  game  introduced  by  Venkateswaran  and  Tompa  [352]  models  parallel 
complexity  classes.  Savage  and  Vitter  [296]  have  also  introduced  a  model  of  parallel  pebbling. 

Branching  programs  have  been  known  as  binary  decision  diagrams  for  at  least  30  years 
[15],  although  their  importance  to  CAD  was  recognized  only  in  the  last  10  or  12  years.  (See 
[60]).  Branching  programs  were  proposed  as  a  vehicle  for  studying  space-time  problems  by 
Pippenger  and  first  studied  by  Tompa  [331],  who  cites  Pippenger  for  Lemma  10.9.2.  Borodin, 
Fischer,  Kirkpatrick,  Lynch,  and  Tompa  [55]  derived  a  lower  bound  of  ST  =  f l(n2)  to 
sort  n  items  with  decision  branching  programs.  Borodin  and  Cook  [53]  formulated  the  same 
problem  in  terms  of  the  general  branching  programs  of  Section  10.9  and  developed  the  general 
framework  used  in  Theorem  10. 1 1 . 1 . 

Yesha  [370]  developed  lower  bounds  on  the  space-time  product  with  branching  prob¬ 
lems  for  the  discrete  Fourier  transform  (see  Theorem  10.13.7)  and  matrix  multiplication  over 
restricted  domains.  Abrahamson  [6]  (see  also  [4])  derived  the  lower  bound  on  ST2  in  The¬ 
orem  10.13.4,  thereby  improving  upon  the  matrix  multiplication  bound  of  Yesha.  He  also 
extended  the  Borodin-Cook  model  to  probabilistic  branching  programs  (see  Problem  10.37) 
and  derived  the  lower  bound  on  ST  for  convolution  (Theorem  10.13.1),  integer  multiplica¬ 
tion  (Theorem  10.13.2),  matrix-vector  multiplication  (Theorem  10.13.3),  and  matrix  inver¬ 
sion  (Theorem  10.13.6).  He  also  developed  a  lower  bound  of  fi(n3)  on  ST  to  compute  the 
product  PAQ  of  three  nxn  matrices,  where  P  and  Q  are  permutation  matrices.  Abrahamson 
has  also  studied  Boolean  matrix  multiplication  in  the  general  branching  program  model  [5]. 
Beame  [34]  has  obtained  the  result  of  Theorem  10.13.8  showing  that  the  unique  elements 
problem  requires  that  ST  =  fi(ro2)  for  general  branching  programs,  which  implies  the  lower 
bound  on  sorting  stated  in  Theorem  10.13.9. 

In  the  comparison-based  branching  program  model,  Borodin,  Fich,  Meyer  auf  der  Heide, 
Upfal,  and  Wigderson  [54]  derive  the  lower  bound  ST  =  H(n3/2V log  n)  for  the  element- 
distinctness  problem  on  n  inputs.  For  the  same  computational  model,  Yao  [369]  improved 
this  to  ST  =  fl(n2_e(n)),  where  e(  n)  is  a  decreasing  function  of  n. 
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Although  serial  programming  languages  assume  that  programs  are  written  for  the  RAM  model, 
this  model  is  rarely  implemented  in  practice.  Instead,  the  random-access  memory  is  replaced 
with  a  hierarchy  of  memory  units  of  increasing  size,  decreasing  cost  per  bit,  and  increasing 
access  time.  In  this  chapter  we  study  the  conditions  on  the  size  and  speed  of  these  units  when 
a  CPU  and  a  memory  hierarchy  simulate  the  RAM  model.  The  design  of  memory  hierarchies 
is  a  topic  in  operating  systems. 

A  memory  hierarchy  typically  contains  the  local  registers  of  the  CPU  at  the  lowest  level  and 
may  contain  at  succeeding  levels  a  small,  very  fast,  local  random-access  memory  called  a  cache, 
a  slower  but  still  fast  random-access  memory,  and  a  large  but  slow  disk.  The  time  to  move  data 
between  levels  in  a  memory  hierarchy  is  typically  a  few  CPU  cycles  at  the  cache  level,  tens  of 
cycles  at  the  level  of  a  random-access  memory,  and  hundreds  of  thousands  of  cycles  at  the  disk 
level!  A  CPU  that  accesses  a  random-access  memory  on  every  CPU  cycle  may  run  at  about 
a  tenth  of  its  maximum  speed,  and  the  situation  can  be  dramatically  worse  if  the  CPU  must 
access  the  disk  frequently.  Thus  it  is  highly  desirable  to  understand  for  a  given  problem  how 
the  number  of  data  movements  between  levels  in  a  hierarchy  depends  on  the  storage  capacity 
of  each  memory  unit  in  that  hierarchy. 

In  this  chapter  we  study  tradeoffs  between  the  number  of  storage  locations  (space)  at  each 
memory-hierarchy  level  and  the  number  of  data  movements  (I/O  time)  between  levels.  Two 
closely  related  models  of  memory  hierarchies  are  used,  the  memory-hierarchy  pebble  game  and 
the  hierarchical  memory  model,  which  are  extensions  of  those  introduced  in  Chapter  10. 

In  most  of  this  chapter  it  is  assumed  not  only  that  the  user  has  control  over  the  I/O  algo¬ 
rithm  used  for  a  problem  but  that  the  operating  system  does  not  interfere  with  the  I/O  oper¬ 
ations  requested  by  the  user.  However,  we  also  examine  I/O  performance  when  the  operating 
system,  not  the  user,  controls  the  sequence  of  memory  accesses  (Section  11.10).  Competi¬ 
tive  analysis  is  used  in  this  case  to  evaluate  two-level  LRU  and  FIFO  memory-management 
algorithms. 
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II. I  The  Red-Blue  Pebble  Game 

The  red-blue  pebble  game  models  data  movement  between  adjacent  levels  of  a  two-level  mem¬ 
ory  hierarchy.  We  begin  with  this  model  to  fix  ideas  and  then  introduce  the  more  general 
memory-hierarchy  game.  Both  games  are  played  on  a  directed  acyclic  graph,  the  graph  of  a 
straight-line  program.  We  describe  the  game  and  then  give  its  rules. 

In  the  red-blue  game,  (hot)  red  pebbles  identify  values  held  in  a  fast  primary  memory 
whereas  (cold)  blue  pebbles  identify  values  held  in  a  secondary  memory.  The  values  identified 
with  the  pebbles  can  be  words  or  blocks  of  words,  such  as  the  pages  used  by  an  operating 
system.  Since  the  red-blue  pebble  game  is  used  to  study  the  number  of  I/O  operations  necessary 
for  a  problem,  the  number  of  red  pebbles  is  assumed  limited  and  the  number  of  blue  pebbles  is 
assumed  unlimited.  Before  the  game  starts,  blue  pebbles  reside  on  all  input  vertices.  The  goal 
is  to  place  a  blue  pebble  on  each  output  vertex,  that  is,  to  compute  the  values  associated  with 
these  vertices  and  place  them  in  long-term  storage.  These  assumptions  capture  the  idea  that 
data  resides  initially  in  the  most  remote  memory  unit  and  the  results  must  be  deposited  there. 

RED-BLUE  PEBBLE  GAME 

•  (Initialization)  A  blue  pebble  can  be  placed  on  an  input  vertex  at  any  time. 

•  (Computation  Step)  A  red  pebble  can  be  placed  on  (or  moved  to)  a  vertex  if  all  its  imme¬ 
diate  predecessors  carry  red  pebbles. 

•  (Pebble  Deletion)  A  pebble  can  be  deleted  from  any  vertex  at  any  time. 

•  (Goal)  A  blue  pebble  must  reside  on  each  output  vertex  at  the  end  of  the  game. 

•  (Input  from  Blue  Level)  A  red  pebble  can  be  placed  on  any  vertex  carrying  a  blue  pebble. 

•  (Output  to  Blue  Level)  A  blue  pebble  can  be  placed  on  any  vertex  carrying  a  red  pebble. 

The  first  rule  (initialization)  models  the  retrieval  of  input  data  from  the  secondary  mem¬ 
ory.  The  second  rule  (a  computation  step)  is  equivalent  to  requiring  that  all  the  arguments 
on  which  a  function  depends  reside  in  primary  memory  before  the  function  can  be  computed. 
This  rule  also  allows  a  pebble  to  move  (or  slide)  to  a  vertex  from  one  of  its  predecessors,  mod¬ 
eling  the  use  of  a  register  as  both  the  source  and  target  of  an  operation.  The  third  rule  allows 
pebble  deletion:  if  a  red  pebble  is  removed  from  a  vertex  that  later  needs  a  red  pebble,  it  must 
be  repebbled. 

The  fourth  rule  (the  goal)  models  the  placement  of  output  data  in  the  secondary  memory 
at  the  end  of  a  computation.  The  fifth  rule  allows  data  held  in  the  secondary  memory  to  be 
moved  back  to  the  primary  memory  (an  input  operation).  The  sixth  rule  allows  a  result  to 
be  copied  to  a  secondary  memory  of  unlimited  capacity  (an  output  operation).  Note  that  a 
result  may  be  in  both  memories  at  the  same  time. 

The  red-blue  pebble  game  is  a  direct  generalization  of  the  pebble  game  of  Section  10.1 
(which  we  call  the  red  pebble  game),  as  can  be  seen  by  restricting  the  sixth  rule  to  allow 
the  placement  of  blue  pebbles  only  on  vertices  that  are  output  vertices  of  the  DAG.  Under 
this  restriction  the  blue  level  cannot  be  used  for  intermediate  results  and  the  goal  of  the  game 
becomes  to  minimize  the  number  of  times  vertices  are  pebbled  with  red  pebbles,  since  the 
optimal  strategy  pebbles  each  output  vertex  once. 
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A  pebbling  strategy  V  is  the  execution  of  the  rules  of  the  pebble  game  on  the  vertices  of 
a  graph.  We  assign  a  step  to  each  placement  of  a  pebble,  ignoring  steps  on  which  pebbles  are 
removed,  and  number  the  steps  consecutively.  The  space  used  by  a  strategy  V  is  defined  as 
the  maximum  number  of  red  pebbles  it  uses.  The  I/O  time,  T2,  of  V  on  the  graph  G  is  the 
number  of  input  and  output  (I/O)  steps  used  by  V .  The  computation  time,  T),  is  the  number 
of  computation  steps  of'P  on  G.  Note  that  time  in  the  red  pebble  game  is  the  time  to  place  red 
pebbles  on  input  and  internal  vertices;  in  this  chapter  the  former  are  called  I/O  operations. 

Since  accesses  to  secondary  memory  are  assumed  to  require  much  more  time  than  accesses 
to  primary  memory,  a  minimal  pebbling  strategy,  Vmin,  performs  the  minimal  number  of 
I/O  operations  on  a  graph  G  for  a  given  number  of  red  pebbles  and  uses  the  smallest  number 
of  red  pebbles  for  a  given  I/O  time.  Furthermore,  such  a  strategy  also  uses  the  smallest  number 

of  computation  steps  among  those  meeting  the  other  requirements.  We  denote  by  Tj  ;  (S,  G) 

(2) 

and  Ty  '  ( S ,  G)  the  number  of  computation  and  I/O  steps  in  a  minimal  pebbling  of  G  in  the 
red-blue  pebble  game  with  S  red  pebbles. 

The  minimum  number  of  red  pebbles  needed  to  play  the  red-blue  pebble  game  is  the 
maximum  number  of  predecessors  of  any  vertex.  This  follows  because  blue  pebbles  can  be  used 
to  hold  all  intermediate  results.  Thus,  in  the  FFT  graph  of  Fig.  11.1  only  two  red  pebbles  are 
needed,  since  one  of  them  can  be  slid  to  the  vertex  being  pebbled.  However,  if  the  minimum 
number  of  pebbles  is  used,  many  expensive  1/ O  operations  are  necessary. 

In  Section  1 1.2  we  generalize  the  red-blue  pebble  game  to  multiple  levels  and  consider  two 
variants  of  the  model,  one  in  which  all  levels  including  the  highest  can  be  used  for  intermediate 
storage,  and  a  second  in  which  the  highest  level  cannot  be  used  for  intermediate  storage.  The 
second  model  (the  I/O-limited  game)  captures  aspects  of  the  red-blue  pebble  game  as  well  as 
the  red  pebble  game  of  Chapter  10. 

An  important  distinction  between  the  pebble  game  results  obtained  in  this  chapter  and 
those  in  Chapter  10  is  that  here  lower  bounds  are  generally  derived  for  particular  graphs, 
whereas  in  Chapter  10  they  are  obtained  for  all  graphs  of  a  problem. 
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11.1.1  Playing  the  Red- Blue  Pebble  Game 

The  rules  for  the  red-blue  pebble  game  are  illustrated  by  the  eight-input  FFT  graph  shown  in 
Fig.  11.1.  If  S'  =  3  red  pebbles  are  available  to  pebble  this  graph  (at  least  S  =  4  pebbles  are 
needed  in  the  one-pebble  game),  a  pebbling  strategy  that  keeps  the  number  of  I/O  operations 
small  is  based  on  the  pebbling  of  sub-FFT  graphs  on  two  inputs.  Three  such  sub-FFT  sub¬ 
graphs  are  shown  by  heavy  lines  in  Fig.  11.1,  one  at  each  level  of  the  FFT  graph.  This  pebbling 
strategy  uses  three  red  pebbles  to  place  blue  pebbles  on  the  outputs  of  each  of  the  four  lowest- 
level  sub-FFT  graphs  on  two  inputs,  those  whose  outputs  are  second-level  vertices  of  the  full 
FFT  graph.  (Thus,  eight  blue  pebbles  are  used.)  Shown  on  a  second-level  sub-FFT  graph  are 
three  red  pebbles  at  the  time  when  a  pebble  has  just  been  placed  on  the  first  of  the  two  outputs 
of  this  sub-FFT  graph.  This  strategy  performs  two  I/O  operations  for  each  vertex  except  for 
input  and  output  vertices.  A  small  savings  is  possible  if,  after  pebbling  the  last  sub-FFT  graph 
at  one  level,  we  immediately  pebble  the  last  sub-FFT  graph  at  the  next  level. 


11.1.2  Balanced  Computer  Systems 

A  balanced  computer  system  is  one  in  which  no  computational  unit  or  data  channel  becomes 
saturated  before  any  other.  The  results  in  this  chapter  can  be  used  to  analyze  balance.  To 
illustrate  this  point,  we  examine  a  serial  computer  system  consisting  of  a  CPU  with  a  random- 
access  memory  and  a  disk  storage  unit.  Such  a  system  is  balanced  for  a  particular  problem  if 
the  time  used  for  I/O  is  comparable  to  the  time  used  for  computation. 

As  shown  in  Section  1 1.5.2,  multiplying  two  n  x  n  matrices  with  a  variant  of  the  classical 
matrix  multiplication  algorithm  requires  a  number  of  computations  proportional  to  n3  and  a 
number  of  I/O  operations  proportional  to  n^/y/S,  where  S  is  the  number  of  red  pebbles  or 
the  capacity  of  the  random-access  memory.  Let  to  and  t\  be  the  times  for  one  computation 
and  I/O  operation,  respectively.  Then  the  system  is  balanced  when  ton?  ss  t\V? / v/ S.  Let  the 
computational  and  I/O  capacities,  CCOmp  and  C\/q,  be  the  rates  at  which  the  CPU  and  disk 
can  compute  and  exchange  data,  respectively;  that  is,  C/omp  =  1  /to  and  Cj/O  =  1/ti.  Thus, 
balance  is  achieved  when  the  following  condition  holds: 

G/o 

From  this  condition  we  see  that  if  through  technological  advance  the  ratio  Ccomv/C\/o  in¬ 
creases  by  a  factor  /?,  then  for  the  system  to  be  balanced  the  storage  capacity  of  the  system,  S, 
must  increase  by  a  factor  f3 2 . 

Hennessy  and  Patterson  [132,  p.  427]  observe  that  CPU  speed  is  increasing  between  50% 
and  100%  per  year  while  that  of  disks  is  increasing  at  a  steady  7%  per  year.  Thus,  if  the  ratio 
Ccomp/Ci/O  for  our  simple  computer  system  grows  by  a  factor  of  50/7  ~  7  per  year,  then 
S  must  grow  by  about  a  factor  of  49  per  year  to  maintain  balance.  To  the  extent  that  matrix 
multiplication  is  typical  of  the  type  of  computing  to  be  done  and  that  computers  have  two- 
level  memories,  a  crisis  is  looming  in  the  computer  industry!  Fortunately,  multi-level  memory 
hierarchies  are  being  introduced  to  help  avoid  this  crisis. 

As  bad  as  the  situation  is  for  matrix  multiplication,  it  is  much  worse  for  the  Fourier  trans¬ 
form  and  sorting.  For  each  of  these  problems  the  number  of  computation  and  I/O  operations 
is  proportional  to  n  log2  n  and  n  log2  n/  log2  S,  respectively  (see  Section  11.5.3).  Thus,  bal- 
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ance  is  achieved  when 


C, 

a 


comp 

I/O 


log2  S 


Consequently,  if  Ccomp/Ci/Q  increases  by  a  factor  (3,  S  must  increase  to  S©  Under  the 
conditions  given  above,  namely,  /3  ss  7,  a  balanced  two-level  memory-hierarchy  system  for 
these  problems  must  have  a  storage  capacity  that  grows  from  S  to  about  S 7  every  year. 


11.2  The  Memory-Hierarchy  Pebble  Game 

The  standard  memory-hierarchy  game  (MHG)  defined  below  generalizes  the  two-level  red- 
blue  game  to  multiple  levels.  The  L-level  MHG  is  played  on  directed  acyclic  graphs  with  pi 
pebbles  at  level  /,  1  <  /  <  L  —  1,  and  an  unlimited  number  of  pebbles  at  level  L.  When 
L  =  2,  the  lower  level  is  the  red  level  and  the  higher  is  the  blue  level.  The  number  of  pebbles 
used  at  the  L  —  1  lowest  levels  is  recorded  in  the  resource  vector  p  =  (p\,P2,  ■  ■  ■  ,Pl- i), 
where  pj  >  1  for  1  <  j  <  L  —  1 .  The  rules  of  the  game  are  given  below. 

STANDARD  MEMORY-HIERARCHY  GAME 

R1 .  (Initialization)  A  level-L  pebble  can  be  placed  on  an  input  vertex  at  any  time. 

R2.  (Computation  Step)  A  first-level  pebble  can  be  placed  on  (or  moved  to)  a  vertex  if  all  its 
immediate  predecessors  carry  first-level  pebbles. 

R3.  (Pebble  Deletion)  A  pebble  of  any  level  can  be  deleted  from  any  vertex. 

R4.  (Goal)  A  level-T  pebble  must  reside  on  each  output  vertex  at  the  end  of  the  game. 

R5.  (Input  from  Level  /)  For  2  <  /  <  L,  a  level©  —  1)  pebble  can  be  placed  on  any  vertex 
carrying  a  level-/  pebble. 

R6.  (Output  to  Level  / )  For  2  <  l  <  L,  a  level-/  pebble  can  be  placed  on  any  vertex  carrying  a 
level©  —  1)  pebble. 

The  first  four  rules  are  exactly  as  in  the  red-blue  pebble  game.  The  fifth  and  sixth  rules  general¬ 
ize  the  fifth  and  sixth  rules  of  the  red-blue  pebble  game  by  identifying  inputs  from  and  outputs 
to  level-/  memory.  These  last  two  rules  allow  a  level-/  memory  to  serve  as  temporary  storage 
for  lower-level  memories. 

In  the  standard  MHG,  the  highest-level  memory  can  be  used  for  storing  intermediate 
results.  An  important  variant  of  the  MHG  is  the  I/O-limited  memory-hierarchy  game,  in 
which  the  highest  level  memory  cannot  be  used  for  intermediate  storage.  The  rules  of  this 
game  are  the  same  as  in  the  MHG  except  that  rule  R6  is  replaced  by  the  following  two  rules: 

I/O-LIMITED  MEMORY-HIERARCHY  GAME 

R6.  (Output  to  Level  /)  For  2  <  /  <  L  —  1 ,  a  level-/  pebble  can  be  placed  on  any  vertex 
carrying  a  level©  —  1)  pebble. 

R7.  (I/O  Limitation)  Level-L  pebbles  can  only  be  placed  on  output  vertices  carrying  level- 
( L  —  1)  pebbles. 
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The  sixth  and  seventh  rules  of  the  new  game  allow  the  placement  of  level-L  pebbles  only  on 
output  vertices.  The  two-level  version  of  the  I/O-limited  MHG  is  the  one-pebble  game  studied 
in  Chapter  10.  As  mentioned  earlier,  we  call  the  two-level  I/O-limited  MHG  the  red  pebble 
game  to  distinguish  it  from  the  red-blue  pebble  game  and  the  MHG.  Clearly  the  multi-level 
I/O-limited  MHG  is  a  generalization  of  both  the  standard  MHG  and  the  one-pebble  game. 

The  I/O-limited  MHG  models  the  case  in  which  accesses  to  the  highest  level  memory  take 
so  long  that  it  should  be  used  only  for  archival  storage,  not  intermediate  storage.  Today  disks 
are  so  much  slower  than  the  other  memories  in  a  hierarchy  that  the  I/O-limited  MHG  is  the 
appropriate  model  when  disks  are  used  at  the  highest  level. 

The  resource  vector  p  =  (pi,P2,  ■  ■  ■  associated  with  a  pebbling  strategy  V  speci¬ 

fies  the  number  of  /-level  pebbles,  pi,  used  by  V .  We  say  that  pi  is  the  space  used  at  level  /  by 
V.  We  assume  that  pi  >  1  for  1  <  /  <  L,  so  that  swapping  between  levels  is  possible.  The 
I/O  time  at  level  l  with  pebbling  strategy  V  and  resource  vector  p,  T^L\p,  G,V),2  <  l  <  L, 
with  both  versions  of  the  MHG  is  the  number  of  inputs  from  and  outputs  to  level  l.  The  com¬ 
putation  time  with  pebbling  strategy  V  and  resource  vector  p,  Tp  ( p ,  G,  V),  in  the  MHG 
is  the  number  of  times  first-level  pebbles  are  placed  on  vertices  by  V .  Since  there  is  little  risk  of 
confusion,  we  use  the  same  notation,  T^L\p,  G,  V),  in  the  standard  and  I/O-limited  MHG 
for  the  number  of  computation  and  I/O  steps. 

The  definition  of  a  minimal  MHG  pebbling  is  similar  to  that  for  a  red-blue  pebbling. 
Given  a  resource  vector  p ,  Vmin  is  a  minimal  pebbling  for  an  L-level  MHG  if  it  minimizes 
the  I/O  time  at  level  L,  after  which  it  minimizes  the  I/O  time  at  level  L  —  1,  continuing  in 
this  fashion  down  to  level  2.  Among  these  strategies  it  must  also  minimize  the  computation 
time.  This  definition  of  minimality  is  used  because  we  assume  that  the  time  needed  to  move 
data  between  levels  of  a  memory  hierarchy  grows  rapidly  enough  with  increasing  level  that  it  is 
less  costly  to  repebble  vertices  at  or  below  a  given  level  than  to  perform  an  I/O  operation  at  a 
higher  level. 


Figure  I  1 .2  Pebbling  an  eight-input  FFT  graph  in  the  three-level  MHG. 
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11.2.1  Playing  the  MHG 

Figure  11.2  shows  the  FFT  graph  on  eight  inputs  being  pebbled  in  a  three-level  MHG  with 
resource  vector  p  =  (2,  4) .  Here  black  circles  denote  first-level  pebbles,  shaded  circles  denote 
second-level  pebbles  and  striped  circles  denote  third-level  pebbles.  Four  striped,  three  shaded 
and  two  black  pebbles  reside  on  vertices  in  the  second  row  of  the  FFT.  One  of  these  shaded 
second-level  pebbles  shares  a  vertex  with  a  black  first-level  pebble,  so  that  this  black  pebble  can 
be  moved  to  the  vertex  covered  by  the  open  circle  without  deleting  all  pebbles  on  the  doubly 
covered  vertex. 

To  pebble  the  vertex  under  the  open  square  with  a  black  pebble,  we  reuse  the  black  pebble 
on  the  open  circle  by  swapping  it  with  a  fourth  shaded  pebble,  after  which  we  place  the  black 
pebble  on  the  vertex  that  was  doubly  covered  and  then  slide  it  to  the  vertex  covered  by  the 
open  box.  This  graph  can  be  completely  pebbled  with  the  resource  vector  p  =  (2,  4)  using 
only  four  third-level  pebbles,  as  the  reader  is  asked  to  show.  (See  Problem  1 1.3.)  Thus,  it  can 
also  be  pebbled  in  the  four-level  I/O-limited  MHG  using  resource  vector  p  =  (2, 4,  4). 

11.3  I/O-Time  Relationships 

The  following  simple  relationships  follow  from  two  observations.  First,  each  input  and  output 
vertex  must  receive  a  pebble  at  each  level,  since  every  input  must  be  read  from  level  L  and 
every  output  must  be  written  to  level  L.  Second,  at  least  one  computation  step  is  needed  for 
each  non-input  vertex  of  the  graph.  Here  we  assume  that  every  vertex  in  V  must  be  pebbled 
to  pebble  the  output  vertices. 

LEMMA  I  1 .3. 1  Let  a  be  the  maximum  in-degree  of  any  vertex  in  G  =  (V,  E )  and  let  In(G) 
and  Out(G)  be  the  sets  of  input  and  output  vertices  of  G,  respectively.  Then  any  pebbling  V  of  G 
with  the  MHG,  whether  standard  or  I/O-limited,  satisfies  the  following  conditions  for  2  <  l  <  L: 

TlL\p,G,V)  >  \In(G)\  +  \Out(G)\ 
T[L\p,G,V)>\V\-\In(G)\ 

The  following  theorem  relates  the  number  of  moves  in  an  L-level  game  to  the  number  in 
a  two-level  game  and  allows  us  to  use  prior  results.  The  lower  bound  on  the  level-/  I/O  time 
is  stated  in  terms  of  s;_i  because  pebbles  at  levels  1,2,...,/—  1  are  treated  collectively  as  red 
pebbles  to  derive  a  lower  bound;  pebbles  at  level  l  and  above  are  treated  as  blue  pebbles. 

THEOREM  I  1 .3. 1  Let  Si  =  Pj-  Then  the  following  inequalities  hold  for  every  L-level 

standard  MHG  pebbling  strategy  V  for  G,  where  p  is  the  resource  vector  used  by  V  and  T,  (S,G) 
and  Ty  (S,G)  are  the  number  of  computation  and  I/O  operations  used  by  a  minimal  pebbling  in 
the  red-blue  pebble  game  played  on  G  with  S  red  pebbles: 

t[l)  (p,  G,  V)  >  T2(2)  («,_!,  G)  fori  <1<L 

Also,  the  following  loiver  bound  on  computation  time  holds  for  all  pebbling  strategies  V  in  the 
standard  MHG: 


T[L\p,G,V)  >  t[2\s\,G), 
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In  the  I/O-limited  case  the  following  lower  bounds  apply,  where  a  is  the  maximum  fan-in  of  any 
vertex  of  G: 

T^L\p,G,V)  >  T2(2)(si_1;G)  for  2  <  l  <  L 
t[l\P,G,V)  >  T^\sL-\,G)/a 

Proof  The  first  set  of  inequalities  is  shown  by  considering  the  red-blue  game  played  with 
S  =  si- 1  red  pebbles  and  an  unlimited  number  of  blue  pebbles.  The  S  red  pebbles  and 
sl- i  —  S  blue  pebbles  can  be  classified  into  L  —  1  groups  with  pj  pebbles  in  the  jth 
group,  so  that  we  can  simulate  the  steps  of  an  L-level  MHG  pebbling  strategy  V.  Because 
there  are  constraints  on  the  use  of  pebbles  in  V,  this  strategy  uses  a  number  of  level-Z  I/O 
operations  that  cannot  be  larger  than  the  minimum  number  of  such  I/O  operations  when 
pebbles  at  level  l  —  1  or  less  are  treated  as  red  pebbles  and  those  at  higher  levels  are  treated 
as  blue  pebbles.  Thus,  T^L\p,  G,  V)  >  (s;_i,  G).  By  similar  reasoning  it  follows  that 

T[L\p,G,V)  >  t[2\Si,g). 

In  the  above  simulation,  blue  pebbles  simulating  levels  l  and  above  cannot  be  used  arbi¬ 
trarily  when  the  I/O-limitation  is  imposed.  To  derive  lower  bounds  under  this  limitation,  we 
classify  S  =  s^_  i  pebbles  into  L  —  1  groups  with  pj  pebbles  in  the  jth  group  and  simulate 
in  the  red-blue  pebble  game  the  steps  of  an  L-level  I/O-limited  MHG  pebbling  strategy  V. 
The  I/O  time  at  level  l  is  no  more  than  the  I/O  time  in  the  two-level  I/O-limited  red-blue 
pebble  game  in  which  all  S  red  pebbles  are  used  at  level  Z  —  1  or  less. 

Since  the  number  of  blue  pebbles  is  unlimited,  in  a  minimal  pebbling  all  I/O  operations 
consist  of  placing  of  red  pebbles  on  blue-pebbled  vertices.  It  follows  that  if  T  I/O  operations 
are  performed  on  the  input  vertices,  then  at  least  T  placements  of  red  pebbles  on  blue- 
pebbled  vertices  occur.  Since  at  least  one  internal  vertex  must  be  pebbled  with  a  red  pebble 

in  a  minimal  pebbling  for  every  a  input  vertices  that  are  red-pebbled,  the  computation  time 

(2) 

is  at  least  T/a.  Specializing  this  to  T  —  Ty  J( sl-i,G )  for  the  I/O-limited  MHG,  we  have 
the  last  result.  ■ 

(2) 

It  is  important  to  note  that  the  lower  bound  to  T,  ;( S ,  G,  V)  for  the  I/O-limited  case  is 
not  stated  in  terms  of  \V\,  because  \  V\  may  not  be  the  same  for  each  values  of  S.  Consider  the 
multiplication  of  two  n  X  n  matrices.  Every  graph  of  the  standard  algorithm  can  be  pebbled 
with  three  red  pebbles,  but  such  graphs  have  about  2 n3  vertices,  a  number  that  cannot  be 
reduced  by  more  than  a  constant  factor  when  a  constant  number  of  red  pebbles  is  used.  (See 
Section  11.5.2.)  On  the  other  hand,  using  the  graph  of  Strassen’s  algorithm  for  this  problem 
requires  at  least  fl(n'38529)  pebbles,  since  it  has  0(n2'807)  vertices. 

We  close  this  section  by  giving  conditions  under  which  lower  bounds  for  one  graph  can 
be  used  for  another.  Let  a  reduction  of  DAG  G i  =  {V\,E\)  be  a  DAG  Go  =  ( Vo,Eq ), 
Vo  C  V\  and  Eq  C  E\,  obtained  by  deleting  edges  from  E\  and  coalescing  the  non-terminal 
vertices  on  a  “chain”  of  vertices  in  Vj  into  the  first  vertex  on  the  chain.  A  chain  is  a  sequence 
V\,V2,  ■  ■  ■  ,vr  of  vertices  such  that,  for  2  <  *  <  r  —  1,  Vi  is  adjacent  to  Vi-\  and  Vi+\  and  no 
other  vertices. 

LEMMA  I  1 .3.2  Let  Go  be  a  reduction  ofG\.  Then  for  any  minimal  pebbling  V min  and  1  < 
l  <  L,  the  following  inequalities  hold: 


©John  E  Savage 


11.4  The  Hong-Kung  Lower-Bound  Method 


537 


Proof  Any  minimal  pebbling  strategy  for  G\  can  be  used  to  pebble  G0  by  simulating  moves 
on  a  chain  with  pebble  placements  on  the  vertex  to  which  vertices  on  the  chain  are  coalesced 
and  by  honoring  the  edge  restrictions  of  G\  that  are  removed  to  create  Gq.  Since  this  strategy 
for  G i  may  not  be  minimal  for  Gq,  the  inequalities  follow.  ■ 

11.4  The  Hong-Kung  Lower-Bound  Method 

In  this  section  we  derive  lower  limits  on  the  I/O  time  at  each  level  of  a  memory  hierarchy 
needed  to  pebble  a  directed  acyclic  graph  with  the  MHG.  These  results  are  obtained  by  com¬ 
bining  the  inequalities  of  Theorem  11.3.1  with  a  lower  bound  on  the  I/O  and  computation 
time  for  the  red-blue  pebble  game. 

Theorem  10.4.1  provides  a  framework  that  can  be  used  to  derive  lower  bounds  on  the  I/O 
time  in  the  red-blue  pebble  game.  This  follows  because  the  lower  bounds  of  Theorem  10.4.1 
are  stated  in  terms  of  Tj,  the  number  of  times  inputs  are  pebbled  with  5  red  pebbles,  which 
is  also  the  number  of  I/O  operations  on  input  vertices  in  the  red-blue  pebble  game.  It  is 
important  to  note  that  the  lower  bounds  derived  using  this  framework  apply  to  every  straight- 
line  program  for  a  problem. 

In  some  cases,  for  example  matrix  multiplication,  these  lower  bounds  are  strong.  However, 
in  other  cases,  notably  the  discrete  Fourier  transform,  they  are  weak.  For  this  reason  we  intro¬ 
duce  a  way  to  derive  lower  bounds  that  applies  to  a  particular  graph  of  a  problem.  If  that  graph 
is  used  for  the  problem,  stronger  lower  bounds  can  be  derived  with  this  method  than  with  the 
techniques  of  Chapter  10.  We  begin  by  introducing  the  5-span  of  a  DAG. 

DEFINITION  I  1 .4. 1  Given  a  DAG  G  =  (V,  E),  the  5-span  of  G,  p(S,  G),  is  the  maximum 
number  of  vertices  of  G  that  can  be  pebbled  with  S  pebbles  in  the  red  pebble  game  maximized  over 
all  initial  placements  of  5  red  pebbles.  (The  initialization  rule  is  disalloived.) 

The  following  is  a  slightly  weaker  but  simpler  version  of  the  Hong-Kung  [137]  lower 
bound  on  I/O  time  for  the  two-level  MHG.  This  proof  divides  computation  time  into  con¬ 
secutive  intervals,  just  as  was  done  for  the  space-time  lower  bounds  in  the  proofs  of  Theo¬ 
rems  10.4.1  and  10.11.1. 

THEOREM  I  1 .4. 1  For  every  pebbling  V  of  the  DAG  G  =  (V,  E )  in  the  red-blue  pebble  game 

(2) 

with  S  red  pebbles,  the  I/O  time  used,  Ty  '  (5,  G,  V),  satisfies  the  following  lower  bound: 
\T?\s,G)/S]p(2S,G)  >  |T/|  -  \In(G)\ 

Proof  Divide  V  into  consecutive  sequential  sub-pebblings  {P\,  V2,  ■  ■  • ,  Vh],  where  each 
sub-pebbling  has  51/0  operations  except  possibly  the  last,  which  has  no  more  such  opera¬ 
tions.  Thus,  h  =  \T^2)(S,G,V)/Sl 

We  now  develop  an  upper  bound  Q  to  the  number  of  vertices  of  G  pebbled  with  red 
pebbles  in  any  sub-pebbling  Vj .  This  number  multiplied  by  the  number  h  of  sub-pebblings 
is  an  upper  bound  to  the  number  of  vertices  other  than  inputs,  |T7|  —  \In(G)\,  that  must  be 
pebbled  to  pebble  G.  It  follows  that 

Qh>  \V\-\In(G)\ 

The  upper  bound  on  Q  is  developed  by  adding  5  new  red  pebbles  and  showing  that 
we  may  use  these  new  pebbles  to  move  all  I/O  operations  in  a  sub-pebbling  Vt  to  either 
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the  beginning  or  the  end  of  the  sub-pebbling  without  changing  the  number  of  computation 
steps  or  I/O  operations.  Thus,  without  changing  them,  we  move  all  computation  steps  to  a 
middle  interval  of  Vt,  between  the  higher-level  I/O  operations. 

We  now  show  how  this  may  be  done.  Consider  a  vertex  v  carrying  a  red  pebble  at  some 
time  during  Vt  that  is  pebbled  for  the  first  time  with  a  blue  pebble  during  Vt  (vertex  7  at 
step  1 1  in  Fig.  11.3).  Instead  of  pebbling  v  with  a  blue  pebble,  use  a  new  red  pebble  to 
keep  a  red  pebble  on  v.  (This  is  equivalent  to  swapping  the  new  and  old  red  pebbles  on  v.) 
This  frees  up  the  original  red  pebble  to  be  used  later  in  the  sub-pebbling.  Because  we  attach 
a  red  pebble  to  v  for  the  entire  pebbling  Vt,  all  later  output  operations  from  v  in  Vt  can 
be  deleted  except  for  the  last  such  operation,  if  any,  which  can  be  moved  to  the  end  of  the 
interval.  Note  that  if  after  v  is  given  a  blue  pebble  in  V,  it  is  later  given  a  red  pebble,  this  red 
pebbling  step  and  all  subsequent  blue  pebbling  steps  except  the  last,  if  any,  can  be  deleted. 
These  changes  do  not  affect  any  computation  step  in  Vt. 

Consider  a  vertex  v  carrying  a  blue  pebble  at  the  start  of  Vt  that  later  in  Vt  is  given  a 
red  pebble  (see  vertex  4  at  step  12  in  Fig.  1 1.3).  Consider  the  first  pebbling  of  this  kind. 
The  red  pebble  assigned  to  v  may  have  been  in  use  prior  to  its  placement  on  v.  If  a  new 
red  pebble  is  used  for  v,  the  first  pebbling  of  v  with  a  red  pebble  can  be  moved  toward 
the  beginning  of  Vt  so  that,  without  violating  the  precedence  conditions  of  G,  it  precedes 
all  placements  of  red  pebbles  on  vertices  without  pebbles.  Attach  this  new  red  pebble  to  v 
during  Vt-  Subsequent  placements  of  red  pebbles  on  v  when  it  carries  a  blue  pebble  during 
Vt,  if  any,  are  thereby  eliminated. 
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Figure  I  1 .3  The  vertices  of  an  FFT  graph  are  numbered  and  a  pebbling  schedule  is  given  in 
which  the  two  numbered  red  pebbles  are  used.  Up  (down)  arrows  identify  steps  in  which  an 
output  (input)  occurs;  other  steps  are  computation  steps.  Steps  10  through  13  of  the  schedule  Vt 
contain  two  I/O  operations.  With  two  new  red  pebbles,  the  input  at  step  12  can  be  moved  to  the 
beginning  of  the  interval  and  the  output  at  step  1 1  can  be  moved  after  step  13. 
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We  now  derive  an  upper  bound  to  Q.  At  the  start  of  the  pebbling  of  the  middle  interval 
of  Vt  there  are  at  most  2 S  red  pebbles  on  G,  at  most  S  original  red  pebbles  plus  S  new  red 
pebbles.  Clearly,  the  number  of  vertices  that  can  be  pebbled  in  the  middle  interval  with  first- 
level  pebbles  is  largest  when  all  2 S  red  pebbles  on  G  are  allowed  to  move  freely.  It  follows 
that  at  most  p(2S,  G)  vertices  can  be  pebbled  with  red  pebbles  in  any  interval.  Since  all 
vertices  must  be  pebbled  with  red  pebbles,  this  completes  the  proof.  ■ 

Combining  Theorems  11.3.1  and  11.4.1  and  a  weak  lower  limit  on  the  size  of  X©'1  (p,  G), 
we  have  the  following  explicit  lower  bounds  to  XJ  L\p,  G). 


COROLLARY  I  1 .4. 1  In  the  standard MHG  when  Tj~L\p,  G )  >  /3(s;_i  —  1)  for  (3  >  1,  the 
following  inequality  holds  for  2  <  l  <  L: 


4L\p,G)> 


P  Sl- 1 
P  +  1  p(2si-\,  G ) 


(\V\-\In(G)\) 


In  the  I/O-limited  MHG  when  T^L\p,  G)  >  P(si-\  —  1)  for  P  >  1,  the  following  mequality 
holds  for  2  <  l  <  L: 


TlL\p,G)> 


P  Sl- 1 
P  +  1  p{2sl_\,G) 


(\V\-\MG)\) 


11.5  Tradeoffs  Between  Space  and  I/O  Time 

We  now  apply  the  Hong-Kung  method  to  a  variety  of  important  problems  including  matrix- 
vector  multiplication,  matrix-matrix  multiplication,  the  fast  Fourier  transform,  convolution, 
and  merging  and  permutation  networks. 

11.5.1  Matrix- Vector  Product 

We  examine  here  the  matrix-vector  product  function  :  Rn  +n  i— >  Rn  over  a  commutative 
ring  TZ  described  in  Section  6.2. 1  primarily  to  illustrate  the  development  of  efficient  multi¬ 
level  pebbling  strategies.  The  lower  bounds  on  I/O  and  computation  time  for  this  problem 
are  trivial  to  obtain.  For  the  matrix-vector  product,  we  assume  that  the  graphs  used  are  those 
associated  with  inner  products.  The  inner  product  u  •  v  of  n-vectors  u  and  v  over  a  ring  TZ 
is  defined  by: 

n 

U  •  V  —  Ui  ■  Vi 
i—  1 

The  graph  of  a  straight-line  program  to  compute  this  inner  product  is  given  in  Fig.  1 1 .4,  where 
the  additions  of  products  are  formed  from  left  to  right. 

The  matrix-vector  product  is  defined  here  as  the  pebbling  of  a  collection  of  inner  product 
graphs.  As  suggested  in  Fig.  1 1 .4,  each  inner  product  graph  can  be  pebbled  with  three  red 
pebbles. 

THEOREM  I  1.5.1  Let  G  be  the  graph  of  a  straight-line  program  for  the  product  of  the  matrix  A 
with  the  vector  x.  Let  G  be  pebbled  in  the  standard  MHG  with  the  resource  vector  p.  There  is  a 


540 


Chapter  1 1  Memory-Hierarchy  Tradeoffs 


Models  of  Computation 


O'  1,1  *^1  ^1,2  %2  ^1,3  *^3  ®l,n  *^n 

Figure  I  1 .4  The  graph  of  an  inner  product  computation  showing  the  order  in  which  vertices 
are  pebbled.  Input  vertices  are  labeled  with  the  entries  in  the  matrix  A  and  vector  x  that  are 
combined.  Open  vertices  are  product  vertices;  those  above  them  are  addition  vertices. 


pebbling  strategy  V  of  Gwithpi  >  1  fori  <  l  <  L—landp\  >  3  such  thatT^L\p,G,V)  = 
In2  —  n,  the  minimum  value,  and  the  following  bounds  hold  simultaneously: 

n 2  +  In  <  tT  (p,  G,  V)  <  In2  +  n 

Proof  The  lower  bound  (p,  G,  V)  >  n2+ln,  1  <  l  <  L,  follows  from  Lemma  11.3.1 
because  there  are  n2  +  n  inputs  and  n  outputs  to  the  matrix-vector  product.  The  upper 
bounds  derived  below  represent  the  number  of  operations  performed  by  a  pebbling  strategy 
that  uses  three  level- 1  pebbles  and  one  pebble  at  each  of  the  other  levels. 

Each  of  the  n  results  of  the  matrix-vector  product  is  computed  as  an  inner  product  in 
which  successive  products  ciijXj  are  formed  and  added  to  a  running  sum,  as  suggested  by 
Fig.  1 1 .4.  Each  of  the  n 2  entries  of  the  matrix  A  (leaves  of  inner  product  trees)  is  used  in 
one  inner  product  and  is  pebbled  once  at  levels  L,L—  1, . . . ,  1  when  needed.  The  n  entries 
in  x  are  used  in  every  inner  product  and  are  pebbled  once  at  each  level  for  each  of  the  n 
inner  products.  First-level  pebbles  are  placed  on  each  vertex  of  each  inner  product  tree  in  the 
order  suggested  in  Fig.  1 1 .4.  After  the  root  vertex  of  each  tree  is  pebbled  with  a  first-level 
pebble,  it  is  pebbled  at  levels  2, . . . ,  L. 

It  follows  that  one  I/O  operation  is  performed  at  each  level  on  each  vertex  associated 
with  an  entry  in  A  and  the  outputs  and  that  n  I/O  operations  are  performed  at  each  level 
on  each  vertex  associated  with  an  entry  in  x,  for  a  total  of  In 2  +  n  I/O  operations  at  each 
level.  This  pebbling  strategy  places  a  first-level  pebble  once  on  each  interior  vertex  of  each 
of  the  n  inner  product  trees.  Such  trees  have  In  —  1  internal  vertices.  Thus,  this  strategy 
takes  In 2  —  n  computation  steps.  ■ 

As  the  above  results  demonstrate,  the  matrix-vector  product  is  an  example  of  an  I/O- 
bounded  problem,  a  problem  for  which  the  amount  of  I/O  required  at  each  level  in  the 
memory  hierarchy  is  comparable  to  the  number  of  computation  steps.  Returning  to  the  dis¬ 
cussion  in  Section  11.1.2,  we  see  that  as  CPU  speed  increases  with  technological  advances,  a 
balanced  computer  system  can  be  constructed  for  this  problem  only  if  the  I/O  speed  increases 
proportionally  to  CPU  speed. 

The  I/O-limited  version  of  the  MHG  for  the  matrix- vector  product  is  the  same  as  the 
standard  version  because  only  first-level  pebbles  are  used  on  vertices  that  are  neither  input  or 
output  vertices. 
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11.5.2  Matrix-Matrix  Multiplication 

In  this  section  we  derive  upper  and  lower  bounds  on  exchanges  between  I/O  time  and  space 
for  the  n  x  n  matrix  multiplication  problem  in  the  standard  and  I/O-limited  MHG.  We  show 
that  the  lower  bounds  on  computation  and  I/O  time  can  be  matched  by  efficient  pebbling 
strategies. 

Lower  bounds  for  the  standard  MHG  are  derived  for  the  family  Tn  of  inner  product 
graphs  for  nxn  matrix  multiplication,  namely,  the  set  of  graphs  to  multiply  two  n  X  n  ma¬ 
trices  using  just  inner  products  to  compute  entries  in  the  product  matrix.  (See  Section  6.2.2.) 
We  allow  the  additions  in  these  inner  products  to  be  performed  in  any  order. 

The  lower  bounds  on  I/O  time  derived  below  for  the  I/O-limited  MHG  apply  to  all  DAGs 
for  matrix  multiplication.  Since  these  DAGs  include  graphs  other  than  the  inner  product  trees 
in  Tn,  one  might  expect  the  lower  bounds  for  the  I/O-limited  case  to  be  smaller  than  those 
derived  for  graphs  in  Tn .  However,  this  is  not  the  case,  apparently  because  efficient  pebbling 
strategies  for  matrix  multiplication  perform  I/O  operations  only  on  input  and  output  vertices, 
not  on  internal  vertices.  The  situation  is  very  different  for  the  discrete  Fourier  transform,  as 
seen  in  the  next  section. 

We  derive  results  first  for  the  red-blue  pebble  game,  that  is,  the  two-level  MHG,  and  then 
generalize  them  to  the  multi-level  MHG.  We  begin  by  deriving  an  upper  bound  on  the  S'-span 
for  the  family  of  inner  product  matrix  multiplication  graphs. 

LEMMA  I  1 .5. 1  For  every  graph  G  £  Tn  the  S-span  p(S,  G )  satisfies  the  bound  p(S,  G)  < 
2S5/2  fiorS  <n2. 

Proof  p(S,  G)  is  the  maximum  number  of  vertices  of  G  £  J~n  that  can  be  pebbled  with 
S  red  pebbles  from  an  initial  placement  of  these  pebbles,  maximized  over  all  such  initial 
placements.  Let  A,  B,  and  C  be  n  X  n  matrices  with  entries  {a»j},  {bi,j},  and  {c,©, 
respectively,  where  1  <  i,j  <  n.  Let  C  =  A  X  B.  The  term  Cjy  =  ’Yhkai.kbk,j  is 
associated  with  the  root  vertex  in  of  a  unique  inner  product  tree.  Vertices  in  this  tree  are 
either  addition  vertices,  product  vertices  associated  with  terms  of  the  form  aitkbk,j,  or  input 
vertices  associated  with  entries  in  the  matrices  A  and  B.  Each  product  term  a^kbkj  is 
associated  with  a  unique  term  Cij  and  tree,  as  is  each  addition  operator. 

Consider  an  initial  placement  of  S  <  n2  pebbles  of  which  r  are  in  addition  trees  (they 
are  on  addition  or  product  vertices).  Let  the  remaining  S  —  r  pebbles  reside  on  input 
vertices.  Let  p  be  the  number  of  product  vertices  that  can  be  pebbled  from  these  pebbled 
inputs.  We  show  that  at  most  p  +  r  —  1  additional  pebble  placements  are  possible  from  the 
initial  placement,  giving  a  total  of  at  most  7r  =  2p  +  r  —  1  pebble  placements.  (Figure  11.5 


al,l  &1,1  al,2  ^2,1  al,l  b\'2  CL  1,2  &2,2  a2,l  &1,1  0, 2,2  ^2,1  a2,l  ^1,2  ®2,2  ^2,2 


(a)  (b)  (c) 


(d) 


Figure  I  1.5  Graph  of  the  inner  products  used  to  form  the  product  of  two  2x2  matrices. 
(Common  input  vertices  are  repeated  for  clarity.) 
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shows  a  graph  G  for  a  2  X  2  matrix  multiplication  algorithm  in  which  the  product  vertices 
are  those  just  below  the  output  vertices.  The  black  vertices  carry  pebbles.  In  this  example 
r  =  2  and  p  =  1 .  While  p  +  r  —  1=2,  only  one  pebble  placement  is  possible  on  addition 
trees  in  this  example.) 

Given  the  dependencies  of  graphs  in  Tn,  there  is  no  loss  in  generality  in  assuming  that 
product  vertices  are  pebbled  before  pebbles  are  advanced  in  addition  trees.  It  follows  that  at 
most  p  +  r  addition- tree  vertices  carry  pebbles  before  pebbles  are  advanced  in  addition  trees. 
These  pebbled  vertices  define  subtrees  of  vertices  that  can  be  pebbled  from  the  p  +  r  initial 
pebble  placements.  Since  a  binary  tree  with  n  leaves  has  n  —  1  non-leaf  nodes,  it  follows 
that  if  there  are  t  such  trees,  at  most  p  +  i —  t  pebble  placements  will  be  made,  not  counting 
the  original  placement  of  pebbles.  This  number  is  maximized  at  t  =  1.  (See  Problem  1 1.9.) 

We  now  complete  the  proof  by  deriving  an  upper  bound  on  p.  Let  A  be  the  0—1  n  x  n 
matrix  whose  (i,j)  entry  is  1  if  the  variable  in  the  (i,j)  position  of  the  matrix  A  carries  a 
pebble  initially  and  0  otherwise.  Let  B  be  similarly  defined  for  B.  It  follows  that  the  ( i,j ) 
entry,  Sij ,  of  the  matrix  product  C  =  A  X  £>,  where  addition  and  multiplication  are  over 
the  integers,  is  equal  to  the  number  of  products  that  can  be  formed  that  contribute  to  the 
( i ,  j )  entry  of  the  result  matrix  C.  Thus  p  =  JT  .  Sjj .  We  now  show  that  p  <  \J  S(S  —  r). 

Let  A  and  B  have  a  and  b  1  ’s,  respectively,  where  a  +  b  =  S  —  r.  There  are  at  most  a/ a 
rows  of  A  containing  at  least  a  l’s.  The  maximum  number  of  products  that  can  be  formed 
from  such  rows  is  ab/a  because  each  1  in  B  combine  with  a  1  in  each  of  these  rows.  Now 
consider  the  product  of  other  rows  of  A  with  columns  of  B.  At  most  S  such  row-column 
inner  products  are  formed  since  at  most  S  outputs  can  be  pebbled.  Since  each  of  them 
involves  a  row  with  at  most  a  l’s,  at  most  aS  products  of  pairs  of  variables  can  be  formed. 
Thus,  a  total  of  at  most  p  =  ab/a  +  aS  products  can  be  formed.  We  are  free  to  choose 
a  to  minimize  this  sum  ( a  =  ab/S  does  this)  but  must  choose  a  and  b  to  maximize  it 
(a  =  (S  —  r)/2  satisfies  this  requirement).  The  result  is  that  p  <  s/~S(S  —  r).  We  complete 
the  proof  by  observing  that  tt  =  2p  +  r  —  1  <  2 sj  SS  for  r  >  0.  ■ 


Theorem  1 1.5.2  states  bounds  that  apply  to  the  computation  and  I/O  time  in  the  red-blue 
pebble  game  for  matrix  multiplication. 


THEOREM  I  1 .5.2  For  every  graph  G  in  the  family  Tn  of  inner  product  graphs  for  multiplying 
two  n  X  n  matrices  and  for  every  pebbling  strategy  V  for  G  in  the  red-blue  pebble  game  that 
uses  S  >  3  red  pebbles ,  the  computation  and  I/O-time  satisfy  the  following  lower  bounds: 

t[2\s,  G,V)  =  n(n3) 

T?\s,G,V)  =  n(j^) 

Furthermore,  there  is  a  pebbling  strategy  V  for  G  with  S  >  3  red  pebbles  such  that  the  following 
upper  bounds  hold  simultaneously: 


t{2\s,G,V)  =  0{rf) 
t?\s,g,v)  =  o(^= 


The  lower  bound  on  I/O  time  stated  above  applies  for  every  graph  of  a  straight-line  program  for 
matrix  multiplication  in  the  I/O-limited  red-blue  pebble  game.  The  upper  bound  on  I/O  time 


©John  E  Savage 


11.5  Tradeoffs  Between  Space  and  I/O  Time 


543 


also  applies  for  this  game.  The  computation  time  in  the  I/O-limited  red-blue  pebble  game  satisfies 
the  following  bound: 


T®(S,G,P)  =  f}[j=J 

Proof  For  the  standard  MHG,  the  lower  bound  to  Tf  ( S ,  G,  V)  follows  from  the  fact  that 
every  graph  in  J~n  has  @(n3)  vertices  and  Lemma  1 1.3.1.  The  lower  bound  to  T ©  ( S ,  G) 


follows  from  Corollary  11.4.1  and  Lemma  11.5.1  and  the  lower  bound  to  Tj  ( S,G,V ) 
for  the  I/O-limited  MHG  follows  from  Theorem  1 1.3.1. 

We  now  describe  a  pebbling  strategy  that  has  the  I/O  time  stated  above  and  uses  the 
obvious  algorithm  suggested  by  Fig.  1 1.6.  If  S  red  pebbles  are  available,  let  r  =  |_\/ S/ 3j  be 
an  integer  that  divides  n.  (If  r  does  not  divide  n,  embed  A,  B  and  C  in  larger  matrices  for 
which  r  does  divide  n.  This  requires  at  most  doubling  n.)  Let  the  n  X  n  matrices  A,  B  and 
C  be  partitioned  into  n/r  x  n/r  matrices;  that  is,  A  =  [ct-if,  B  =  [bij],  and  C  =  [c©, 
whose  entries  are  r  X  r  matrices.  We  form  the  r  x  r  submatrix  Cij  of  C  as  the  inner  product 
of  a  row  of  r  x  r  submatrices  of  A  with  a  column  of  such  submatrices  of  B: 


r 


We  begin  by  placing  blue  pebbles  on  each  entry  in  matrices  A  and  B.  Compute  Cij  by 
computing  a^q  x  bq j  for  q  =  1,2 , ...  ,r  and  adding  successive  products  to  the  running 
sum.  Keep  r2  red  pebbles  on  the  running  sum.  Compute  a^q  x  bqj  by  placing  and  holding 
r2  red  pebbles  on  the  entries  in  ai<q  and  r  red  pebbles  on  one  column  of  bqj  at  a  time.  Use 
two  additional  red  pebbles  to  compute  the  r2  inner  products  associated  with  entries  of 
in  the  fashion  suggested  by  Fig.  11.4  if  r  >  2  and  one  additional  pebble  if  r  =  1 .  The 
maximum  number  of  red  pebbles  in  use  is  3  if  r  =  1  and  at  most  2 r2  +  r  +  2  if  r  >  2. 
Since  2 r2  +  r  +  2  <  3r2  for  r  >  2,  in  both  cases  at  most  3r2  red  pebbles  are  needed.  Thus, 
there  are  enough  red  pebbles  to  play  this  game  because  r  =  [y  <S/3j  implies  that  3r2  <  S, 
the  number  of  red  pebbles.  Since  r  >  1,  this  requires  that  S  >  3. 


■  □□□ 

□  □□□ 

□  □□□ 

□  □  □  □ 

C 

Figure  I  1 .6  A  pebbling  schema  for  matrix  multiplication  based  on  the  representation  of  a 
matrix  in  terms  of  block  submatrices.  A  submatrix  of  C  is  computed  as  the  inner  product  of  a 
row  of  blocks  of  A  with  a  column  of  blocks  of  B. 
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This  algorithm  performs  one  input  operation  on  each  entry  of  a^q  and  bq<j  to  compute 
Cij .  It  also  performs  one  output  operation  per  entry  to  compute  Cj j  itself.  Summing  over 
all  values  of  t  and  j,  we  find  that  n 2  output  operations  are  performed  on  entries  in  C.  Since 
there  are  ( n/r )2  submatrices  a^q  and  bq<j  and  each  is  used  to  compute  n/r  terms  cUtV,  the 
number  of  input  operations  on  entries  in  A  and  B  is  2 (n/r)2r2(n/r)  =  2 n3/r.  Because 
r  =  U/S73J  ,  we  have  r  >  S/3  —  1,  from  which  the  upper  bound  on  the  number  of 

I/O  operations  follows.  Since  each  product  and  addition  vertex  in  each  inner  product  graph 
is  pebbled  once,  0(n3)  computation  steps  are  performed. 

The  bound  on  Ty  ;  ( S ,  G,  V)  for  the  I/O-limited  game  follows  from  two  observations. 
First,  the  computational  inequality  of  Theorem  10.4.1  provides  a  lower  bound  to  T/,  the 
number  of  times  that  input  vertices  are  pebbled  in  the  red-pebble  game  when  only  red 
pebbles  are  used  on  vertices.  This  is  the  I/O-limited  model.  Second,  the  lower  bound  of 
Theorem  10.5.4  on  T  (actually,  Tf)  is  of  the  form  desired.  ■ 


These  results  and  the  strategy  given  for  the  two-level  case  carry  over  to  the  multi-level  case, 
although  considerable  care  is  needed  to  insure  that  the  pebbling  strategy  does  not  fragment 
memory  and  lead  to  inefficient  upper  bounds. 

Even  though  the  pebbling  strategy  given  below  is  an  I/O-limited  strategy,  it  provides 
bounds  on  time  in  terms  of  space  that  match  the  lower  bounds  for  the  standard  MHG. 

THEOREM  I  1 .5.3  For  every  graph  G  in  the  family  Tn  of  inner  product  graphs  for  multiplying 
two  n  x  n  matrices  and  for  every  pebbling  strategy  V  for  G  in  the  standard  MHG  with  resource 
vector  p  that  uses  p\  >  3  first-level  pebbles,  the  computation  and  I/O  time  satisfy  the  following 
Lower  bounds,  where  S;  =  ^T=1  pj  and  k  is  the  largest  integer  such  that  Sk  <  3 n2: 


T(L) 

rp{L) 


(p,  G,r)  =  n 

(P,G,V)=  I 


(n3) 

n  (rf/fisi-f) 
Q  ( n 2) 


for  2  <  l  <  k 
for  k  +  1  <  l  <  L 


Furthermore,  there  is  a  pebbling  strategy  V  for  G  with  p\  >  3  such  that  the  following  upper  bounds 
hold  simultaneously: 


t[l\p,  G,V) 
t[l\p,  G,V) 


0(n3) 

j  0(n3/fisJZ T) 

1  O  (n2) 


for  2  <  l  <  k 
for  k  +  1  <  l  <  L 


In  the  I/O-limited  MHG  the  upper  bounds  given  above  apply.  The  following  lower  bound  on  the 
I/O  time  applies  to  every  graph  G  for  n  x  n  matrix  multiplication  and  every  pebbling  strategy  V, 
where  S  =  Sl_i¬ 


t/L)  (p,  G,V)  =  FL  (n3/VS^j  for  \<1<L 

Proof  The  lower  bounds  on  T^L\p,  G,  V),  2  <  l  <  L,  follow  from  Theorems  11.3.1  and 
11.5.2.  The  lower  bound  on  T[L\p,G,V )  follows  from  the  fact  that  every  graph  in  Tn 
has  @(n3)  vertices  to  be  pebbled. 
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r-2  =  n  Ws2  -  1/ (v/3ci)J 

Figure  11.7  A  three-level  decomposition  of  a  matrix. 


We  now  describe  a  multi-level  recursive  pebbling  strategy  satisfying  the  upper  bounds 
given  above.  It  is  based  on  the  two-level  strategy  given  in  the  proof  of  Theorem  1 1.5.2.  We 
compute  C  from  A  and  B  using  inner  products. 

Our  approach  is  to  successively  block  A,  B,  and  C  into  r,;  x  r?;  submatrices  for  i  = 
k,  k  —  1, . . . ,  1  where  the  r,;  are  chosen,  as  suggested  in  Fig.  1 1.7,  so  they  divide  on  another 
and  avoid  memory  fragmentation.  Also,  they  are  also  chosen  relative  to  sy  so  that  enough 
pebbles  are  available  to  pebble  r,  x  r,  submatrices,  as  explained  below. 


\/V3j 

^(Si-i+  1  )/(y/3ri_l) 


n- 1 


i  =  1 


i  >  2 


Using  the  fact  that  b/2  <  a\b/a\  <  b  for  integers  a  and  b  satisfying  1  <  a  <  b  (see 
Problem  11.1),  we  see  that  fy(sj  —  i  +  1 )/ 12  <  r,  <  —  i  +  l)/3.  Thus,  Sj  > 

3r?  +  i  —  1.  Also,  <  n2  because  Sk  <  3n2. 

By  definition,  s;  pebbles  are  available  at  level  l  and  below.  As  stated  earlier,  there  is  at 
least  one  pebble  at  each  level  above  the  first.  From  the  s;  pebbles  at  level  l  and  below  we 
create  a  reserve  set  containing  one  pebble  at  each  level  except  the  first.  This  reserve  set  is 
used  to  perform  I/O  operations  as  needed. 

Without  loss  of  generality,  assume  that  divides  n.  (If  not,  n  must  be  at  most  doubled 
for  this  to  be  true.  Embed  A,  B,  and  C  in  such  larger  matrices.)  A,  B,  and  C  are  then 
blocked  into  x  submatrices  (call  them  dij,  bitj,  and  c©,  and  these  in  turn  are  blocked 
into  rfc_i  xrfc_i  submatrices,  continuing  until  lxl  submatrices  are  reached.  The  submatrix 
Cij  is  defined  as 
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Tk 

ci,j  =  ^2  ai-1  X  bq,j 

9=1 

As  in  Theorem  11.5.2,  Cij  is  computed  as  a  running  sum,  as  suggested  in  Fig.  11.4, 
where  each  vertex  is  associated  with  an  rk  x  rk  submatrix.  It  follows  that  3 rj?  pebbles  at 
level  k  or  less  (not  including  the  reserve  pebbles)  suffice  to  hold  pebbles  on  submatrices 
bqj  and  the  running  sum.  To  compute  a  product  Oji9  x  bqj,  we  represent  a,j  q  and  bqj  as 
block  matrices  with  blocks  that  are  r^-i  x  r^-i  matrices.  Again,  we  form  this  product  as 
suggested  in  Fig.  1 1.4,  using  3r^_1  pebbles  at  levels  k  —  1  or  lower.  This  process  is  repeated 
until  we  encounter  a  product  of  r\  X  r\  matrices,  which  is  then  pebbled  according  to  the 
procedure  given  in  the  proof  of  Theorem  1 1.5.2. 

Let’s  now  determine  the  number  of  I/O  and  computation  steps  at  each  level.  Since  all 
non-input  vertices  of  G  are  pebbled  once,  the  number  of  computation  steps  is  0(n3).  I/O 
operations  are  done  only  on  input  and  output  vertices.  Once  an  output  vertex  has  been 
pebbled  at  the  first  level,  reserve  pebbles  can  be  used  to  place  a  level-L  pebble  on  it.  Thus 
one  output  is  done  on  each  of  the  n2  output  vertices  at  each  level. 

We  now  count  the  I/O  operations  on  input  vertices  starting  with  level  k.  nxn  matrices 
A,  B,  and  C  contain  rk  X  matrices,  where  r k  divides  n.  Each  of  the  (ti/rf)2  submatrices 
a,itq  and  bqj  is  used  in  (n/r^)  inner  products  and  at  most  rj,  I/O  operations  at  level  k  are 
performed  on  them.  (If  most  of  the  Sk  pebbles  at  level  k  or  less  are  at  lower  levels,  fewer 
level-fc  I/O  operations  will  be  performed.)  Thus,  at  most  2(n/rfc)2(n/rfc)rfc  =  2n2 /rk 
I/O  operations  are  performed  at  level  k.  In  turn,  each  of  the  rj.  X  r k  matrices  contains 
{Xk/fk- 1)2  Tk-i  x  rfe_i  matrices;  each  of  these  is  involved  in  (rk/vk-i)  inner  products 
each  of  which  requires  at  most  j  I/O  operations.  Since  there  are  at  most  (n/rk- 1)2 
Tk- i  x  7"fe_ i  submatrices  in  each  of  A,  B,  and  C,  at  most  liA /rk-\  I/O  operations  are 
performed  at  level  k  —  1.  Continuing  in  this  fashion,  at  most  2n3/r;  I/O  operations  are 
performed  at  level  l  for  2  <  l  <  k.  Since  r;  >  y/(si  —  *  +  1)/ 12,  we  have  the  desired 
conclusion. 

Since  the  above  pebbling  strategy  does  not  place  pebbles  at  level  2  or  above  on  any  vertex 
except  input  and  output  vertices,  it  applies  in  the  I/O-limited  case.  The  lower  bound  follows 
from  Lemma  11.3.1  and  Theorem  11.5.2.  ■ 

11.5.3  The  Fast  Fourier  Transform 

The  fast  Fourier  transform  (FFT)  algorithm  is  described  in  Section  6.7.3  (an  FFT  graph  is 
given  in  Fig.  11.1).  A  lower  bound  is  obtained  by  the  Hong-Kung  method  for  the  FFT  by 
deriving  an  upper  bound  on  the  S'-span  of  the  FFT  graph.  In  this  section  all  logarithms  have 
base  2. 

LEMMA  I  1.5.2  The  S-span  of  the  FFT  graph  F ^  on  n  =  2d  inputs  satisfies  p(S,G)  < 
2 S  log  S  when  S  <  n. 

Proof  p(S,  G )  is  the  maximum  number  of  vertices  of  G  that  can  be  pebbled  with  S  red 
pebbles  from  an  initial  placement  of  these  pebbles,  maximized  over  all  such  initial  place¬ 
ments.  G  contains  many  two-input  FFT  (butterfly)  graphs,  as  shown  in  Fig.  11.8.  If  Vi 
and  V2  are  the  output  vertices  in  such  a  two-input  FFT  and  if  one  of  them  is  pebbled,  we 
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Figure  I  1 .8  A  two-input  butterfly  graph  with  pebbles  p\  and  P2  resident  on  inputs. 


obtain  an  upper  bound  on  the  number  of  pebbled  vertices  if  we  assume  that  both  of  them 
are  pebbled.  In  this  proof  we  let  {pi  |  1  <  *  <  S}  denote  the  S  pebbles  available  to  pebble 
G.  We  assign  an  integer  cost  num(pi )  (initialized  to  zero)  to  the  nh  pebble  p,  in  order  to 
derive  an  upper  bound  to  the  total  number  of  pebble  placements  made  on  G. 

Consider  a  matching  pair  of  output  vertices  V\  and  V2  of  a  two-input  butterfly  graph 
and  their  common  predecessors  Mi  and  u2,  as  suggested  in  Fig.  11.8.  Suppose  that  on  the 
next  step  we  can  place  a  pebble  on  V\ .  Then  pebbles  (call  them  pi  and  p2)  must  reside  on 
Mi  and  u2.  Advance  p\  and  p2  to  both  V\  and  v2.  (Although  the  rules  stipulate  that  an 
additional  pebble  is  needed  to  advance  the  two  pebbles,  violating  this  restriction  by  allowing 
their  movement  to  Mi  and  v2  can  only  increase  the  number  of  possible  moves,  a  useful  effect 
since  we  are  deriving  an  upper  bound  on  the  number  of  pebble  placements.) 

After  advancing  p\  and  p2,  if  num(p\)  =  num(p2),  augment  both  by  1;  otherwise, 
augment  the  smaller  by  1.  Since  the  predecessors  of  two  vertices  in  an  FFT  graph  are  in 
disjoint  trees,  there  is  no  loss  in  assuming  that  all  S  pebbles  remain  on  the  graph  in  a 
pebbling  that  maximizes  the  number  of  pebbled  vertices.  Because  two  pebble  placements 
are  possible  each  time  num(pi)  increases  by  1  for  some  i,  p(S,  G)  <2  ]© <i<s  num^pi). 

We  now  show  that  the  number  of  vertices  that  contained  pebbles  initially  and  are  con¬ 
nected  via  paths  to  the  vertex  covered  by  pi  is  at  least  2nurn(pi\  That  is,  2nurn<'Pi'>  <  S 
or  num(pi)  <  log2  S,  from  which  the  upper  bound  on  p(S,  G )  follows.  Our  proof  is  by 
induction.  For  the  base  case  of  num(pi )  =  1,  two  pebbles  must  reside  on  the  two  immedi¬ 
ate  predecessors  of  a  vertex  containing  the  pebble  Pi.  Assume  that  the  hypothesis  holds  for 
num{pi )  <  e  —  1.  We  show  that  it  holds  for  num(pi )  =  e.  Consider  the  first  point  in 
time  that  num(pi)  =  e.  At  this  time  Pi  and  a  second  pebble  pj  reside  on  a  matching  pair 
of  vertices,  Mi  and  v2.  Before  these  pebbles  are  advanced  to  these  two  vertices  from  U\  and 
u2,  the  immediate  predecessors  of  Mi  and  v2 ,  the  smaller  of  num(pi )  and  num(pj)  has  a 
value  of  e  —  1.  This  must  be  pi  because  its  value  has  increased.  Thus,  each  of  Mi  and  u2 
has  at  least  2e_1  predecessors  that  contained  pebbles  initially.  Because  the  predecessors  of  Mi 
and  u2  are  disjoint,  each  of  Mi  and  v2  has  at  least  2e  =  2num(pi'>  predecessors  that  carried 
pebbles  initially.  ■ 

This  upper  bound  on  the  ,5-span  is  combined  with  Theorem  11.4.1  to  derive  a  lower 
bound  on  the  I/O  time  at  level  l  to  pebble  the  FFT  graph.  We  derive  upper  bounds  that  match 
to  within  a  multiplicative  constant  when  the  FFT  graph  is  pebbled  in  the  standard  MHG.  We 
develop  bounds  for  the  red-blue  pebble  game  and  then  generalize  them  to  the  MHG. 
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THEOREM  I  1 .5.4  Let  the  FFT graph  on  n  =  2d  inputs,  F^d\  be  pebbled  in  the  red-blue 
pebble  game  with  S  red  pebbles.  When  S  >  3  there  is  a  pebbling  of  F ^  such  that  the  following 
bounds  hold sirmdtaneously,  where  F^)  andT^ (p\,F^df  are  the  computation  and 

I/O  time  in  a  minimal  pebbling  ofF^df 


t[2\s,F W)  =  0 (n  log  n) 


tP(S,F^)  =  0 


nlogro\ 


Proof  The  lower  bound  on  t[2\s,fW)  is  obvious;  every  vertex  in  F^  must  be  peb¬ 
bled  a  first  time.  The  lower  bound  on  T22  (<S,  F^)  follows  from  Corollary  11.4.1,  Theo¬ 
rem  1 1.3.1,  Lemma  1 1.5.2,  and  the  obvious  lower  bound  on  \  V\.  We  now  exhibit  a  pebbling 
strategy  giving  upper  bounds  that  match  the  lower  bounds  up  to  a  multiplicative  factor. 

As  shown  in  Corollary  6.7.1,  F^  can  be  decomposed  into  \d/e\  stages,  [d/e J  stages 
containing  2d~e  copies  of  F^  and  one  stage  containing  2d~k  copies  of  F^k\  k  =  d  — 
[d/e J  e.  (See  Fig.  1 1.9.)  The  output  vertices  of  one  stage  are  the  input  vertices  to  the  next. 
For  example,  F^  can  be  decomposed  into  three  stages  with  212~4  =  256  copies  of  F^ 
on  each  stage  and  one  stage  with  212  copies  of  F^°\  a  single  vertex.  (See  Fig.  11.10.)  We  use 
this  decomposition  and  the  observation  that  F^  can  be  pebbled  level  by  level  with  2e  +  1 
level- 1  pebbles  without  repebbling  any  vertex  to  develop  our  pebbling  strategy  for  F^d\ 

Given  S  red  pebbles,  our  pebbling  strategy  is  based  on  this  decomposition  with  e  = 
do  =  Lfog2(5l  —  1).  Since  S  >  3,  d0  >  1.  Of  the  S  red  pebbles,  we  actually  use  only 
S0  =  2d°  +  1.  Since  So  <  S,  the  number  of  I/O  operations  with  Sq  red  pebbles  is  no 


^b,\  *b,  2  ^ b,(3 

Figure  I  1 .9  Decomposition  of  the  FFT  graph  F ^  into  (3  =  2e  bottom  FFT  graphs  F^d  e ^ 
and  r  =  2d~e  top  F^e\  Edg  es  between  bottom  and  top  sub-FFT  graphs  identify  common 
vertices  between  the  two. 
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Figure  II.IO  The  decomposition  of  an  FFT  graph  into  three  stages  each  containing  256 
copies  of  /,,V| ' .  The  gray  areas  identify  rows  of  1 in  which  inputs  to  one  copy  of  F^  are 
outputs  of  copies  of  F ^  at  the  preceding  level. 


less  than  with  S  red  pebbles.  Let  d\  =  \d/do\ .  Then,  F<d>  is  decomposed  into  d\  stages 
each  containing  2d~d°  copies  of  F(d°')  and  one  stage  containing  2d~l  copies  of  F ^  where 
t  =  d  —  dod\.  Since  t  <  do,  each  vertex  in  can  be  pebbled  with  So  pebbles  without 
re-pebbling  vertices.  The  same  applies  to  F^d°\ 

The  pebbling  strategy  for  the  red-blue  pebble  game  is  based  on  this  decomposition. 
Pebbles  are  advanced  to  outputs  of  each  of  the  bottom  FFT  subgraphs  F W  using  2* + 1  <  So 
red  pebbles,  after  which  the  red  pebbles  are  replaced  with  blue  pebbles.  The  subgraphs  F ^ 
in  each  of  the  succeeding  stages  are  then  pebbled  in  the  same  fashion;  that  is,  their  blue- 
pebbled  inputs  are  replaced  with  red  pebbles  and  red  pebbles  are  advanced  to  their  outputs 
after  which  they  are  replaced  with  blue  pebbles. 

This  strategy  pebbles  each  vertex  once  with  red  pebbles  with  the  exception  of  vertices 
common  to  two  FFT  subgraphs  which  are  pebbled  twice.  It  follows  that  T^L\s,  F 1©  < 
2d+\d  +  1)  =  2n(log2  n  +  1).  This  strategy  also  executes  one  I/O  operation  for  each 
of  the  2d  inputs  and  outputs  to  F,'d'>  and  two  I/O  operations  for  each  of  the  2d  vertices 
common  to  adjacent  stages.  Since  there  are  \d/do]  stages,  there  are  \d/do]  —  1  such  pairs 
of  stages.  Thus,  the  number  of  I/O  operations  satisfies  ©©S',  F^dl)  <  2d+l  \d/do]  < 
2n(log2  n/(log2  S/A)  +  1)  =  0(n\ogn/\ogS).  a 

The  bounds  for  the  multi-level  case  generalize  those  for  the  red-blue  pebble  game.  As  with 
matrix  multiplication,  care  must  be  taken  to  avoid  memory  fragmentation. 


THEOREM  I  1.5.5  Let  the  FFT  graph  077  n  =  2d  inputs,  F^d\  bepebbledin  the  standard  MHG 
with  resource  vector  p.  Let  Si  =  J pj  and  let  k  be  the  largest  integer  such  that  Sk  <  n.  When 
Pi  >  3,  the  following  lower  bounds  hold for  all  pebblings  of  F ^  and  there  exists  a  pebbling  V  for 
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which  the  upper  bounds  are  simultaneously  satisfied: 

f0(nlogn)  l  =  1 

0(liS)  2<l<k 

0(ro)  k  +  1  <  l  <  L 

Proof  Proofs  of  the  first  two  lower  bounds  follow  from  Lemma  11.3.1  and  Theorem  11.5.4. 
The  third  follows  from  the  fact  that  pebbles  at  every  level  must  be  placed  on  each  input  and 
output  vertex  but  no  intermediate  vertex.  We  now  exhibit  a  pebbling  strategy  giving  upper 
bounds  that  match  (up  to  a  multiplicative  factor)  these  lower  bounds  for  all  1  <  l  <  L. 
(See  Fig.  1 1.9.) 

We  define  a  non-decreasing  sequence  d  =  ( d0 ,  d\,  dj,  ■  .  ■ ,  dt-i)  of  integers  used  be¬ 
low  to  describe  an  efficient  multi-level  pebbling  strategy  for  F^d\  Let  do  =  1  and  d\  = 
[log(si  —  1)J  >  1,  where  Si  =  p\  >  3.  Define  mr  and  dr  for  2  <  r  <  L  —  1  by 

|logmin(sr  —  l,n)  I 

mr  =  - - 

dr- 1 

dr  =  TYirdr—\ 

It  follows  that  sr  >  2dr  +  1  when  sr  <  n  +  1  since  a\b/a J  <  b.  Because  [log a,J  > 
(loga)/2  when  a  >  1  and  also  a\b/a\  >  bj 2  for  integers  a  and  b  when  1  <  a  <  b  (see 
Problem  11.1),  it  follows  that  dr  >  log(min(sr  —  l,n))/4.  The  values  di  are  chosen  to 
avoid  memory  fragmentation. 

Before  describing  our  pebbling  strategy,  note  that  because  we  assume  at  least  one  pebble 
is  available  at  each  level  in  the  hierarchy,  it  is  possible  to  perform  an  I/O  operation  at  each 
level.  Also,  pebbles  at  levels  less  than  l  can  be  used  as  though  they  were  at  level  l. 

Our  pebbling  strategy  is  based  on  the  decomposition  of  F ^  into  FFT  subgraphs  F^dk\ 
each  of  which  is  decomposed  into  FFT  subgraphs  F:dk~'\  and  so  on,  until  reaching  FFT 
subgraphs  F^  that  are  two-input,  two-output  butterfly  graphs.  To  pebble  F ^  we  apply 
the  strategy  described  in  the  proof  of  Theorem  11.5.4  as  follows.  We  decompose  F ^ 
into  G^/di  stages,  each  containing  2dl~d 1  copies  of  F^\  which  we  pebble  with  Si  =  p\ 
first-level  pebbles  using  this  strategy.  By  the  analysis  in  the  proof  of  Theorem  1 1.5.4,  2dl+x 
level-2  I/O  operations  are  performed  on  inputs  and  outputs  to  F ^  as  well  as  another  2dz+1 
level-2  I/O  operations  on  the  vertices  between  two  stages.  Since  there  are  G^/di  stages,  a 
total  of  ( d2/di)2dl+l  level-2  I/O  operations  are  performed.  We  then  decompose  F into 
d$/d2  stages  each  containing  2d3~dl  copies  of  F1'1'1 .  We  pebble  F ^  with  S2  pebbles  at  level 
1  or  2  by  pebbling  copies  of  F^2'1  in  stages,  using  ((Z3 / d2)2d3+1  level-3  I/O  operations  and 
using  (c/3 / d2)2di~dl  times  as  many  level-2  I/O  operations  as  used  by  F (2\  Let  rij1  be  the 
number  of  level-2  I/O  operations  used  to  pebble  F^ .  Then  n^1  =  (d^/dfi  2di+1 . 

Continuing  in  this  fashion,  we  pebble  F^r\  1  <  r  <  fc,  with  sr_i  pebbles  at  levels  l  or 
below  by  pebbling  copies  of  F^-r~1^  in  stages,  using  ( dr/dr-\)2dr+1  level-r  I/O  operations 
and  using  (dr/dr- l)2dr~d'T-'  as  many  level-j  I/O  operations  for  1  <  j  <  r  —  1.  Let  rilr> 
be  the  number  of  level-j  I/O  operations  used  to  pebble  F^r\  By  induction  it  follows  that 
n'p  =  (dr  /  dj)2dr+1 . 

For  r  >  k,  the  number  of  pebbles  available  at  level  r  or  less  is  at  least  2d  +  1 ,  which  is 
enough  to  pebble  F ^  by  levels  without  performing  I/O  operations  above  level  k  +  1;  this 
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means  that  I/O  operations  at  these  levels  are  performed  only  on  inputs,  giving  the  bound 
Tf-L)(p,F(d\V)  =  0{n),  n  =  2d,  for  k  +  1  <  r  <  L.  When  r  <  k,  we  pebble  F ^  by 
decomposing  it  into  \d/dk]  stages  such  that  each  stage,  except  possibly  the  first,  contains 
2  d~dk  copies  of  the  FFT  subgraph  F^dk\  The  first  stage  has  2d~d  copies  of  F^d  1  of  depth 
d*  =  d—  (\d/dk\  —  l)dfc,  which  we  treat  as  subgraphs  of  the  subgraph  F^^  and  pebble  to 
completion  with  a  number  of  operations  at  each  level  that  is  at  most  the  number  to  pebble 
F^dk\  Each  instance  of  F^dk^  is  pebbled  with  Sfc_i  pebbles  at  level  k  —  1  or  lower  and 
a  pebble  at  level  k  or  higher  is  left  on  its  output.  Since  s^+i  >  n  +  1,  there  are  enough 
pebbles  to  do  this. 

Thus  T©  (p,  F^d\  V)  satisfies  the  following  bound  for  1  <  l  <  L: 

T }L\p,F^d\V)  <  \d/ 4]2d-dfcT/z')(p, F(dk),V) 

Combining  this  with  the  earlier  result,  we  have  the  following  upper  bound  on  the  number 
of  I/O  operations  for  1  <  l  <  k: 

t[l\p,fW,v)  <  \d/dk](dk/di)2d+1 

Since,  as  noted  earlier,  dr  >  log(min(sr  —  1,  n))/4,  we  obtain  the  desired  upper  bound  on 
X©  (p,  F^d\  V)  by  combining  this  result  with  the  bound  on  njk')  given  above.  ■ 


The  above  results  are  derived  for  standard  MHG  and  the  family  of  FFT  graphs.  We  now 
strengthen  these  results  in  two  ways  when  the  I/O-limited  MHG  is  used.  First,  the  I/O  limita¬ 
tion  requires  more  time  for  a  given  amount  of  storage  and,  second,  the  lower  bound  we  derive 
applies  to  every  graph  for  the  discrete  Fourier  transform,  not  just  those  for  the  FFT. 

It  is  important  to  note  that  the  efficient  pebbling  strategy  used  in  the  standard  MHG 
makes  extensive  use  of  level-L  pebbles  on  intermediate  vertices  of  the  FFT  graph.  When  this  is 
not  allowed,  the  lower  bound  on  the  I/O  time  is  much  larger.  Since  the  lower  bounds  for  the 
standard  and  I/O-limited  MHG  on  matrix  multiplication  are  about  the  same,  this  illustrates 
that  the  DFT  and  matrix  multiplication  make  dramatically  different  use  secondary  memory. 
(In  the  following  theorem  a  linear  straight-line  program  is  a  straight-line  program  in  which 
the  operations  are  additions  and  multiplications  by  constants.) 


THEOREM  I  1.5.6  Let  FFT{n)  be  any  DAG  associated  with  the  DFT  on  n  inputs  when  real¬ 
ized  by  a  linear  straight-line  program.  Let  FFT[n )  be  pebbled  with  strategy  V  in  the  I/O-Iimited 
MHG  with  resource  vector  p  and  let  si  =  ]Cj=i  Pj-  tf  $  =  sL-\  <  n>  then  for  each  pebbling 
strategy  V,  the  computation  and  I/O  time  at  level  l  must  satisfy  the  following  bounds: 


Ti{L)(p,FFT(n),V )  =  Cl 


for  1  <  /  <  T 


Also,  when  n  =  2d,  there  is  a  pebbling  V  of  the  FFT  graph  F1^1'1  such  that  the  following  relations 
hold  simultaneously  when  S  >2  log  n: 


o(^--FnlogS')  Z=1 

U  +niogSi_,  ) 


t[l)  (p,  F^d\V) 


2  <  l  <  L 
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Proof  The  lower  bound  follows  from  Theorem  1 1.3.1  and  Theorem  10.5.5.  We  show  that 
the  upper  bounds  can  be  achieved  on  F ^  under  the  I/O  limitation  simultaneously  for 
1  <  l  <  L. 

The  pebbling  strategy  meeting  the  lower  bounds  is  based  on  that  used  in  the  proof  of 
Theorem  10.5.5  to  pebble  F ^  using  S  <  2d  +  1  pebbles  in  the  red  pebble  game.  The 
number  of  level- 1  pebble  placements  used  in  that  pebbling  is  given  in  the  statement  of 
Theorem  10.5.5.  A  level-2  I/O  operation  occurs  once  on  each  of  the  2d  outputs  and  2d~e 
times  on  each  of  the  2d  inputs  of  the  bottom  FFT  subgraphs,  for  a  total  of  2d(2d~e  +  1) 
times. 

The  pebbling  for  the  L-level  MHG  is  patterned  after  the  aforementioned  pebbling  for 
the  red  pebble  game,  which  is  based  on  the  decomposition  of  Lemma  6.7.4.  (See  Fig.  1 1.9.) 
Let  e  be  the  largest  integer  such  that  S  >  2 e  +  d  —  e.  Pebble  the  binary  subtrees  on 
2d~e  inputs  in  the  2e  bottom  subgraphs  F^dm  e ■*  as  follows:  On  an  input  vertex  level-A 
pebbles  are  replaced  by  pebbles  at  all  levels  down  to  and  including  the  first  level.  Then  level- 

1  pebbles  are  advanced  on  the  subtrees  in  the  order  that  minimizes  the  number  of  level- 1 
pebbles  in  the  red  pebble  game.  It  may  be  necessary  to  use  pebbles  at  all  levels  to  make  these 
advances;  however,  each  vertex  in  a  subtree  (of  which  there  are  2d~e+l  —  1)  experiences  at 
most  two  transitions  at  each  level  in  the  hierarchy.  In  addition,  each  vertex  in  a  bottom 
tree  is  pebbled  once  with  a  level- 1  pebble  in  a  computation  step.  Therefore,  the  number  of 
level-/  transitions  on  vertices  in  the  subtrees  is  at  most  2d+1  (2d~ed~1  —  1)  for  2  <  /  <  L, 
since  this  pebbling  of  2e  subtrees  is  repeated  2d~e  times. 

Once  the  inputs  to  a  given  subgraph  F^J  have  been  pebbled,  the  subgraph  itself  is 
pebbled  in  the  manner  indicated  in  Theorem  11.5.5,  using  0(e2e/ log sj_i)  pebbles  at 
each  level  /  for  2  <  /  <  L.  Since  this  is  done  for  each  of  the  2d~e  subgraphs  F^ ,  it 
follows  that  on  the  top  FFT  subgraphs  a  total  of  0(e 2d/log  s;_ i)  level-/  transitions  occur, 

2  <  /  <  L.  In  addition,  each  vertex  in  a  graph  F^J  is  pebbled  once  with  a  level-1  pebble 
in  a  computation  step. 

It  follows  that  at  most 


T^L)(p,F(d\V)  =  o(  2d(2d~e+1  -  1)  +  — - ^ 

V  logSi_i/ 

level-/  I/O  operations  occur  for  2  <  /  <  L,  as  well  as 

t[L\P,  F(d) ,V)  =  0(2d(2d~e+l  -  1)  +e2d) 


computation  steps.  It  is  left  to  the  reader  to  verify  that  2e  <  2e+d—  e  <  S  <  2 e+l  +d—e— 
1  <  42 e  when  e  +  1  >  log  d  (this  is  implied  by  S  >  2d),  from  which  the  result  follows.  ■ 


11.5.4  Convolution 

The  convolution  function  fcom^  '■  Rn+m  i->  Rn+m~l  over  a  commutative  ring  TZ  (see 
Section  6.7.4)  maps  an  n-tuple  a  and  an  m-tuple  b  onto  an  (n  +  m  —  l)-tuple  c  and  is 
denoted  c  =  a  ®  b.  An  efficient  straight-line  program  for  the  convolution  is  described  in 
Section  6.7.4  that  uses  the  convolution  theorem  (Theorem  6.7.2)  and  the  FFT  algorithm. 
The  convolution  theorem  in  terms  of  the  2n-point  DFT  and  its  inverse  is 

a  ®  b  =  F^(F2n(a)  x  F2n(b)) 
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Obviously,  when  n  =  2d  the  2n-point  DFT  can  be  realized  by  the  2n-point  FFT.  The  DAG 
associated  with  this  algorithm,  shown  in  Fig.  11.11  for  d  =  4,  contains  three  copies  of  the 
FFT  graph  F^ld\ 

We  derive  bounds  on  the  computation  and  I/O  time  in  the  standard  and  I/O-limited 
memory-hierarchy  game  needed  for  the  convolution  function  using  this  straight-line  program. 
For  the  standard  MHG,  we  invoke  the  lower  bounds  and  an  efficient  algorithm  for  the  FFT. 
For  the  I/O-limited  MHG,  we  derive  new  lower  bounds  based  on  those  for  two  back-to-back 
FFT  graphs  as  well  as  upper  bounds  based  on  the  I/O-limited  pebbling  algorithm  given  in 
Theorem  1 1.5.4  for  FFT  graphs. 

THEOREM  I  1 .5.7  Let  Convolve  be  the  graph  of  a  straight-line  program  for  the  convolution  of 
two  n-tuples  using  the  convolution  theorem,  n  =  2d.  Let  Convolve  be  pebbled  in  the  standard 
MHG  with  the  resource  vector  p.  Let  si  =  ]Cj=i  Pj  am ^  ^et  &  be  the  largest  integer  such  that 

Sk  <  n.  When  p\  >  3  there  is  a  pebbling  of  Gconvoive  for  which  the  following  bounds  hold 
simultaneously: 


0(nlog  n) 

Q  /  n  log  n  \ 

V.  log  si  — ,  J 
0(n) 


1  =  1 

2  < l < k+ 1 

k+2 < l < L 


Proof  The  lower  bound  follows  from  Lemma  11.3.2  and  Theorem  11.5.5.  From  the  for¬ 
mer,  it  is  sufficient  to  derive  lower  bounds  for  a  subgraph  of  a  graph.  Since  F  ^  is  contained 
in  GconmWe’  t^le  l°wer  bound  follows. 


Figure  I  I .  I  I  A  DAG  for  the  graph  of  the  convolution  theorem  on  n  =  8  inputs. 
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The  upper  bound  follows  from  Theorem  11.5.5.  We  advance  level-L  pebbles  to  the 
outputs  of  each  of  the  two  bottom  FFT  graphs  F^ld2>  in  Fig.  11.11  and  then  pebble  the  top 
FFT  graph.  The  number  of  I/O  and  computation  steps  used  is  triple  that  used  to  pebble 
one  such  FFT  graph.  In  addition,  we  perform  0(n )  I/O  and  computation  steps  to  combine 
inputs  to  the  top  FFT  graph.  ■ 

The  bounds  for  the  I/O-limited  version  of  the  MHG  for  the  convolution  problem  are 
considerably  larger  than  those  for  the  standard  MHG.  They  have  a  much  stronger  dependence 
on  S  and  n  than  do  those  for  the  FFT  graph. 

THEOREM  I  1.5.8  Let  Convolve  be  t^>e  graph  °f  anJ  DAG  for  the  convolution  of  two  n-tuples 
using  the  convolution  theorem,  n  =  2d.  Let  H involve  b e  pebbled  in  the  L/O-limited  MHG 
with  the  resource  vector  p  and  let  si  =  ff!j— \  Pj-  LfS  =  Sl-i  <  n,  then  the  time  to  pebble 

-^convolve  at  fhe  ^  level,  ( p ,  ^convolve)’  satisfies  the  following  lower  bounds  simultaneously 
for  \  <  l  <  L: 

when  S  <  n/  log  n. 

Proof  A  lower  bound  is  derived  for  this  problem  by  considering  a  generalization  of  the 
graph  shown  in  Fig.  11.11  in  which  the  three  copies  of  the  FFT  graph  F^2d^  are  replaced  by 
an  arbitrary  DAG  for  the  DFT.  This  could  in  principle  yield  in  a  smaller  lower  bound  on  the 
time  to  pebble  the  graph.  We  then  invoke  Lemma  1 1.3.2  to  show  that  a  lower  bound  can 
be  derived  from  a  reduction  of  this  new  graph,  namely,  that  consisting  of  two  back-to-back 
DFT  graphs  obtained  by  deleting  one  of  the  bottom  FFT  graphs.  We  then  derive  a  lower 
bound  on  the  time  to  pebble  this  graph  with  the  red  pebble  game  and  use  it  together  with 
Theorem  1 1.3.1  to  derive  the  lower  bounds  mentioned  above. 

Consider  pebbling  two  back-to-back  DAGs  for  the  DFT  on  n  inputs,  n  even,  in  the  red 
pebble  game.  From  Lemma  10.5.4,  the  n-point  DFT  function  is  (2,  n,  n,  n/2) -indepen¬ 
dent.  From  the  definition  of  the  independence  property  (see  Definition  10.4.2),  we  know 
that  during  a  time  interval  in  which  2 (S'  +  1)  of  the  n  outputs  of  the  second  DFT  DAG 
on  n-inputs  are  pebbled,  at  least  n/2  —  2 (S  +  1)  of  its  inputs  are  pebbled.  In  a  back-to- 
back  DFT  graph  these  inputs  are  also  outputs  of  the  first  DFT  graph.  It  follows  that  for 
each  group  of  2 (S  +  1)  of  these  n/2  —  2 (S  +  1)  outputs  of  the  first  DFT  DAG,  at  least 
n/2  —  2 (S  +  1)  of  its  inputs  are  pebbled.  Thus,  to  pebble  a  group  of  2 (S  +  1)  outputs 
of  the  second  FFT  DAG  (of  which  there  are  at  least  \n/{2{S  +  1))J  groups),  at  least 
[(n/2  —  2  (S  +  1))/2(S  +  1)J  {n/2  —  2{S  +  1))  inputs  of  the  first  DFT  must  be  pebbled. 
Thus,  >  n3/{64{S  +  l)2)  ,  since  it  holds  both  when  S  <  n/4f 2  and 

when  S  >  n/4\[2. 

Let’s  now  consider  a  pebbling  strategy  that  achieves  this  lower  bound  up  to  a  multiplica¬ 
tive  constant.  The  pebbling  strategy  of  Theorem  11.5.5  can  be  used  for  this  problem.  It 
represents  the  FFT  graph  F ^  as  a  set  of  FFT  graphs  FG)  0n  top  and  a  set  of  FFT  graphs 
F^d~e)  on  the  bottom.  Outputs  of  one  copy  of  F G')  are  pebbled  from  left  to  right.  This 
requires  pebbling  inputs  of  F ^  from  left  to  right  once.  To  pebble  all  outputs  of  F^d\  2d~e 
copies  of  FG)  are  pebbled  and  the  2d  inputs  to  F ^  are  pebbled  2d  e  times. 
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Figure  11.12  An  I/O-limited  pebbling  of  a  DAG  for  the  convolution  theorem  showing  the 
placement  of  eight  pebbles. 


Consider  the  graph  (^convolve  consisting  of  three  copies  of  F ^ ,  two  on  the  bottom  and 
one  on  top,  as  shown  in  Fig.  11.12.  Using  the  above  strategy,  we  pebble  the  outputs  of  the 
two  bottom  copies  of  F^d>  from  left  to  right  in  parallel  a  total  of  2d~e  times.  The  outputs 
of  these  two  graphs  are  pebbled  in  synchrony  with  the  pebbling  of  the  top  copy  of  F^ .  It 
follows  that  the  number  of  I/O  and  computation  steps  used  on  the  bottom  copies  of  F ^ 
in  ^convolve  *s  2(2d_e)  times  the  number  on  one  copy,  with  twice  as  many  pebbles  at  each 
level  plus  the  number  of  such  steps  on  the  top  copy  of  F ^ .  It  follows  that  Gyonvoive  can 
be  pebbled  with  three  times  the  number  of  pebbles  at  each  level  as  can  F^d\  with  0  ( 2d~e) 
times  as  many  steps  at  each  level.  The  conclusion  of  the  theorem  follows  from  manipulation 
of  terms.  ■ 

The  bounds  given  above  also  apply  to  some  permutation  and  merging  networks.  Since, 
as  shown  in  Section  6.8,  the  graph  of  Batcher’s  bitonic  merging  network  is  an  FFT  graph, 
the  bounds  on  I/O  and  computation  time  given  earlier  for  the  FFT  also  apply  to  it.  Also,  as 
shown  in  Section  7.8.2,  since  a  permutation  network  can  be  constructed  of  two  FFT  graphs 
connected  back-to-back,  the  lower  bounds  for  convolution  apply  to  this  graph.  (See  the  proofs 
of  Theorems  11.5.7  and  11.5.8.)  The  same  order-of-magnitude  upper  bounds  follow  from 
constructions  that  differ  only  in  details  from  those  given  in  these  theorems. 


11.6  Block  I/O  in  the  MHG 


Many  memory  units  move  data  in  large  blocks,  not  in  individual  words,  as  generally  assumed 
in  the  above  sections.  (Note,  however,  that  one  pebble  can  carry  a  block  of  data.)  Data  is 
moved  in  blocks  because  the  time  to  fetch  one  word  and  a  block  of  words  is  typically  about  the 
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Figure  11.13  A  disk  unit  with  three  platters  and  two  heads  per  disk.  Each  track  is  divided  into 
four  sectors  and  heads  move  in  and  out  on  a  common  arm.  The  memory  of  the  disk  controller 
holds  the  contents  of  one  track  on  one  disk. 


same.  Figure  11.13  suggests  why  this  is  so.  A  disk  spinning  at  3,600  rpm  that  has  40  sectors 
per  track  and  512  bits  per  sector  (its  block  size)  requires  about  10  msec  to  find  data  in  the  track 
under  the  head.  However,  the  time  to  read  one  sector  of  64  bytes  (512  bits)  is  just  .42  msec. 

To  model  this  phenomenon,  we  assume  that  the  time  to  access  k  disk  sectors  with  con¬ 
secutive  addresses  is  a  +  k/3,  where  a  is  a  large  constant  and  j3  is  a  small  one.  (This  topic  is 
also  discussed  in  Section  7.3.)  Given  the  ratio  of  a  to  (3,  it  makes  sense  to  move  data  to  and 
from  a  disk  in  blocks  of  size  about  equal  to  the  number  of  bytes  on  a  track.  Some  operating 
systems  move  data  in  track-sized  blocks,  whereas  others  move  them  in  smaller  units,  relying 
upon  the  fact  that  a  disk  controller  typically  keeps  the  contents  of  its  current  track  in  a  fast 
random-access  memory  so  that  successive  sector  accesses  can  be  done  quickly. 

The  gross  characteristics  of  disks  described  by  the  above  assumption  hold  for  other  storage 
devices  as  well,  although  the  relative  values  of  the  constants  differ.  For  example,  in  the  case  of  a 
tape  unit,  advancing  the  tape  head  to  the  first  word  in  a  consecutive  sequence  of  words  usually 
takes  a  long  time,  but  successive  words  can  be  read  relatively  quickly. 

The  situation  with  interleaved  random-access  memory  is  similar,  although  the  physi¬ 
cal  arrangement  of  memory  is  radically  different.  As  depicted  in  Fig.  11.14,  an  interleaved 
random-access  memory  is  a  collection  of  2r  memory  modules,  r  >  1,  each  containing  2k 
6-bit  words.  Such  a  memory  can  simulate  a  single  2r+,‘'-word  6-bit  random-access  memory. 
Words  with  addresses  0,  2r ,  2  2r,  3  2r, .  . . ,  2k~12r  are  stored  in  the  first  module,  words  with 
addresses  1,  2r  +  1, 2  2r  +  1,  3  2r  +  1, . . . ,  2k~l2r  +  1  in  the  second  module,  and  words  with 
addresses  2r  —  1 , 2  2r  —  1 , 3  2r  —  1 , 4  2r  —  l, ...  ,2r+k  —  1  in  the  last  module. 

To  access  a  word  in  this  memory,  the  high  order  k  bits  are  provided  to  each  module.  If 
a  set  of  words  is  to  be  read,  the  words  with  these  common  high-order  bits  are  copied  to  the 
registers.  If  a  set  of  words  is  to  be  written,  new  values  are  copied  from  the  registers  to  them. 

When  an  interleaved  memory  is  used  to  simulate  a  much  faster  random-access  memory, 
a  CPU  writes  to  or  reads  from  the  2r  registers  serially,  whereas  data  is  transferred  in  parallel 
between  the  registers  and  the  modules.  The  use  of  two  sets  of  registers  (double  buffering) 
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Memory  Modules 
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Double-Buffered  Registers 

Figure  11.14  Eight  interleaved  memory  modules  with  double  buffering.  Addresses  are  supplied 
in  parallel  while  data  is  pipelined  into  and  out  of  the  memory. 


allows  the  register  sets  to  be  alternated  so  that  data  can  be  moved  continuously  between  the 
CPU  and  the  modules.  This  allows  the  interleaved  memory  to  be  about  2r  times  slower  than 
the  CPU  and  yet,  with  a  small  set  of  fast  registers,  appear  to  be  as  fast  as  the  CPU.  This  works 
only  if  the  program  accessing  memory  does  not  branch  to  a  new  set  of  words.  If  it  does,  the 
startup  time  to  access  a  new  word  is  about  2r  times  the  CPU  speed.  Thus,  an  interleaved 
random-access  memory  also  requires  time  of  the  form  a  +  k(3  to  access  k  words.  For  example, 
for  a  moderately  fast  random-access  chip  technology  a  might  be  80  nanoseconds  whereas  f3 
might  be  10  nanoseconds,  a  ratio  of  8  to  1. 

This  discussion  justifies  assuming  that  the  time  to  move  k  words  with  consecutive  addresses 
to  and  from  the  Zth  unit  in  the  memory  hierarchy  is  a;  +  kPi  for  positive  constants  a;  and 
Pi,  where  ai  is  typically  much  larger  than  Pi.  If  k  =  bi  =  \oq/ p{\,  then  ai  +  kPi  ~  2a; 
and  the  time  to  retrieve  one  item  and  6;  items  is  about  the  same.  Thus,  efficiency  dictates  that 
items  should  be  fetched  in  blocks,  especially  if  all  or  most  of  the  items  in  a  block  can  be  used  if 
one  of  them  is  used.  This  justifies  the  block-I/O  model  described  below.  Here  we  let  t;  be  the 
time  to  move  a  block  at  level  l.  We  add  the  requirement  that  data  stored  together  be  retrieved 
together  to  reflect  physical  constraints  existing  in  practice. 

DEFINITION  I  1 .6. 1  (Block-I/O  Model)  At  the  Ith  level  in  a  memory  hierarchy,  1. 10  operations 
are  performed  on  blocks.  The  block  size  and  the  time  in  seconds  to  access  a  block  at  the  Ith  level  are 
bi  and  ti,  respectively.  For  each  l,  6;/6;_i  is  an  integer.  In  addition,  any  data  written  as  part  of  a 
block  at  level  l  must  be  read  into  level  l  —  1  by  reading  the  entire  block  in  which  it  was  stored. 

The  lower  bounds  on  the  number  of  I/O  steps  given  in  Section  11.5  can  be  generalized  to 
the  block-I/O  case  by  dividing  the  number  of  I/O  operations  by  the  size  6;  of  blocks  moving 
between  levels  l  —  1  and  l.  This  lower  bound  can  be  achieved  for  matrix-vector  and  matrix- 
matrix  multiplication  because  data  is  always  written  to  and  read  from  the  higher-level  memory 
in  the  same  way  for  these  problems.  (See  Problems  11.13  and  11.14.) 
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For  the  FFT  graph  in  the  standard  MHG,  instead  of  pebbling  FFT  subgraphs  on  2dr 
inputs,  we  pebble  6;  FFT  subgraphs  on  2dr /bi  inputs  (assuming  that  bi  is  a  power  of  2). 
Doing  so  allows  all  the  data  moving  back  and  forth  in  blocks  between  memories  to  be  used 
and  accommodates  the  transposition  mentioned  at  the  beginning  of  Section  11.5.3.  This 
provides  an  upper  bound  of  0(n  log  log(s;_i/6j_i)))  on  the  I/O  time  at  level  l. 

Clearly,  when  6/_i  is  much  smaller  than  S;_i,  say  6/_i  =  0(^/s;_i),  the  upper  and  lower 
bounds  match  to  within  a  multiplicative  factor.  (This  follows  because  we  divide  n  by  6;_i  and 
log  bi_ i  =  0(log  Si- 1).)  These  observations  apply  to  the  FFT-based  problems  as  well. 

11.7  Simulating  a  Fast  Memory  in  the  MHG 

In  this  section  we  revisit  the  discussion  of  Section  1 1 . 1 .2,  taking  into  account  that  a  memory 
hierarchy  may  have  many  levels  and  that  data  is  moved  in  blocks. 

We  ask  the  question,  “How  do  we  assess  the  effectiveness  of  a  memory  hierarchy  on  a 
particular  problem?”  For  several  problems  we  have  upper  and  lower  bounds  on  their  number  of 
computation  and  I/O  steps  in  memory  hierarchies  parameterized  by  block  sizes  and  numbers  of 
storage  locations.  If  we  add  to  this  mix  the  time  to  move  a  block  between  levels,  we  can  derive 
bounds  on  the  time  for  all  computation  and  I/O  steps.  We  then  ask  under  what  conditions 
this  time  is  the  best  possible.  Since  data  must  typically  be  stored  and  retrieved  from  archival 
memory,  we  cannot  expect  the  performance  to  exceed  that  of  a  two-level  hierarchy  (modeled 
by  the  red-blue  pebble  game)  in  which  all  the  available  storage  locations,  except  for  those  in 
the  archival  memory,  are  in  first-level  storage.  For  this  reason  we  use  the  two-level  memory 
as  our  reference  model.  We  now  define  these  terms  and  state  a  condition  for  optimality  of  a 
pebbling  strategy. 

For  1  <  l  <  L  —  1  we  let  ti  be  the  time  to  move  one  block  of  words  between  levels  l—l 
and  l  of  a  memory  hierarchy,  measured  as  a  multiple  of  the  time  to  perform  one  computation 
step.  Thus,  the  time  for  one  computation  step  is  t\  =  1. 

Let  V  be  a  pebbling  strategy  for  a  graph  G  in  the  L-level  MHG  that  uses  the  resource 
vector  p  =  (pi,p2,  ■  ■  ■  ,Pl- i)  ( Pi  pebbles  are  used  at  the  l th  level)  and  moves  data  in  blocks 
of  size  specified  by  b  =  (bz,  63, . . . ,  &/,)  (6/  words  are  moved  between  levels  (Z  —  1)  and  /).  Let 
T[L)(p,  b,  G )  denote  the  number  of  level-/  I/O  operations  with  V  on  G.  We  define  the  time 
for  the  pebbling  strategy  'P ,  T(V,  G)  on  the  graph  G  as 

L 

T(V,G)  =  '52trT}L)(p,b,G ) 

1= 1 

Thus,  T(V,G)  measures  the  absolute  time  expended  to  pebble  a  graph  relative  to  the  time 
to  perform  one  computation  step  under  the  assumption  that  I/O  operations  cannot  be  over¬ 
lapped. 

From  the  above  discussion,  a  pebbling  is  efficient  ifT(V,  G)  is  at  most  some  small  multiple 
of  Tj  (sl-i,G),  the  normalized  time  to  pebble  G  in  the  red-blue  pebble  game  when  all  the 
pebbles  at  level  L  —  1  or  less  in  the  MHG  (there  are  sl-  1  such  pebbles)  are  used  as  if  they 
were  red  pebbles. 

A  two-level  computation  exhibits  locality  of  reference  if  it  is  likely  in  the  near  future 
to  refer  to  words  currently  in  its  primary  memory.  Such  computations  perform  fewer  I/O 
operations  than  those  that  don’t  meet  this  condition.  This  idea  extends  to  multiple  levels:  a 
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multi-level  memory  hierarchy  exhibits  locality  of  reference  if  it  uses  its  higher-level  memory 
units  much  less  often  that  its  lower-level  units.  Formally,  we  say  that  a  pebbling  strategy  V  is 
c-local  if  T(V,  G )  satisfies  the  following  inequality: 

L 

J2  *i '  TlL)  (p,  b,G,V)  <  cT[2)  (sl_ u  G ) 

1=1 

The  definition  of  a  c-local  pebbling  strategy  is  illustrated  by  the  results  for  matrix  multipli¬ 
cation  in  the  standard  MHG  when  block  I/O  is  not  used.  Let  k  be  the  largest  integer  such  that 
Sfc  <  3 n2.  From  Theorem  1 1.5.3  for  matrix-matrix  multiplication,  we  see  that  there  exists  an 
optimal  pebbling  if 


k 


E 


u 

ky/Sl-  1 


L 

+  E 

l=k+\ 


tl 

nbi 


<  c* 


(11.1) 


for  some  c*  >  0  since  t[2\s,  G)  =  Q(n3). 

We  noted  in  Section  11.1.2  that  the  imbalance  between  the  computation  and  I/O  times 
for  matrix  multiplication  is  becoming  ever  more  serious  with  the  advance  of  technology.  We 
re-examine  this  issue  in  light  of  the  above  condition.  Consider  the  case  in  which  k  +  1  =  L\ 
that  is,  the  highest-level  memory  is  used  to  store  the  arguments  and  results  of  a  computation. 
In  this  case  the  second  term  on  the  left-hand  side  of  (11.1)  is  a  relative  measure  of  the  time 
to  bring  data  into  lower-level  memories  from  the  highest-level  memory.  It  is  negligible  when 
n&z,  is  large.  For  example,  if  =  2,000,000  and  =  10,000,  say,  then  n  must  be  at  least 
200,  a  modest-sized  matrix.  The  first  term  on  the  left-hand  side  reflects  the  number  of  times 
data  moves  between  the  levels  of  the  hierarchy  holding  the  data.  It  is  small  when  bi^/si-i 
is  large  by  comparison  with  ti  for  2  <  l  <  k,  a  condition  that  is  not  hard  to  meet.  For 
example,  if  S;_i  =  32  x  106  (about  4  Mbytes)  and  bi  =  1,000,  then  f;  must  be  less  than 
about  45,  a  condition  that  certainly  applies  to  low  level  memories  such  as  today’s  random- 
access  memories.  Problems  11.15  and  11.16  provide  opportunities  to  explore  this  issue  with 
the  FFT  and  convolution. 


UA  RAM-Based  I/O  Models 

The  MHG  assumes  that  computations  are  done  by  pebbling  the  vertices  of  a  directed  acyclic 
graph.  That  is,  it  assumes  that  computations  are  straight-line.  While  the  best  known  algo¬ 
rithms  for  the  problems  studied  earlier  in  this  chapter  are  straight-line,  some  problems  are  not 
efficiently  done  in  a  straight-line  fashion.  For  example,  binary  search  in  a  tree  that  holds  a  set 
of  keys  in  sorted  order  (see  Section  11.9.1)  is  much  better  suited  to  data-dependent  compu¬ 
tation  of  the  kind  allowed  by  an  unrestricted  RAM.  Similarly,  the  merging  of  two  sorted  lists 
can  be  done  more  efficiently  on  a  RAM  than  with  a  straight-line  program.  For  this  reason 
we  consider  RAM-based  I/O  models,  specifically  the  block-transfer  model  and  the  hierarchical 
memory  model. 

11.8.1  The  Block-Transfer  Model 

The  block-transfer  model  is  a  two-level  I/O  model  that  generalizes  the  red-blue  pebble  game 
to  RAM-based  computations  by  allowing  programs  that  are  not  straight-line. 
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DEFINITION  I  1 .8. 1  The  block-transfer  model  (BTM)  is  a  serial  computer  in  which  a  CPU  is 
attached  to  an  M -word  primary  memory  and  to  a  secondary  memory  of  unlimited  size  that  stores 
words  in  blocks  of  size  B.  Words  are  moved  in  blocks  between  the  memories  and  words  that  leave 
primary  memory  in  one  block  must  return  in  that  block.  An  I/O  operation  is  the  movement  of  a 
block  to  or  from  secondary  memory.  The  I/O  time  with  the  BTM  is  the  number  of  I/O  operations. 

The  secondary  memory  in  the  BTM  can  be  a  main  memory  if  the  primary  memory  is  a 
cache,  or  can  be  a  disk  if  the  primary  memory  is  a  random-access  memory.  In  fact,  it  can  model 
I/O  operations  between  any  two  devices.  Since  a  block  can  be  viewed  as  the  contents  of  one 
track  of  a  disk,  the  time  to  retrieve  any  word  on  the  track  is  comparable  to  the  time  to  retrieve 
the  entire  track.  (See  Section  1 1.6.)  Since  data  is  moved  in  blocks  in  the  BTM,  it  makes  sense 
to  define  simple  I/O  operations. 

DEFINITION  I  1.8.2  An  I/O  operation  in  the  BTM  is  simple  if  after  a  block  or  word  is  copied 
from  one  memory  to  the  other,  the  copy  in  the  first  memory  is  deleted. 

Simple  I/O  operations  for  the  pebble  game  are  defined  in  Problem  11.10.  In  this  problem 
the  reader  is  asked  to  show  that  replacing  all  I/O  operations  with  simple  I/O  operations  has 
the  effect  of  at  most  doubling  the  number  of  I/O  operations.  The  proof  of  this  fact  applies 
equally  well  to  the  BTM. 

We  illustrate  the  use  of  the  block-transfer  model  by  examining  the  sorting  problem.  We 
derive  a  lower  bound  on  the  I/O  time  for  all  sorting  algorithms  and  exhibit  a  sorting  algorithm 
that  meets  the  lower  bound,  up  to  a  constant  multiplicative  factor.  To  derive  the  lower  bound, 
we  limit  the  range  of  sorting  algorithms  to  those  based  on  the  comparison  of  keys,  as  stated 
below.  (Sorting  algorithms  that  are  not  comparison-based,  such  as  the  various  forms  of  radix 
sort,  assume  that  keys  consist  of  individual  digits  and  that  digits  are  used  to  classify  keys.) 

ASSUMPTION  I  1 .8. 1  All  words  to  be  sorted  are  located  initially  in  the  secondary  memory.  The 
compare-exchange  operation  is  the  only  operation  available  to  implement  sorting  algorithms  on 
the  BTM.  In  addition,  an  arbitrary  permutation  of  the  contents  of  the  primary  memory  of  the  BTM 
can  be  done  during  the  time  required  for  one  I/O  operation. 

The  assumption  that  the  CPU  can  perform  an  arbitrary  permutation  on  the  contents  of  the 
primary  memory  during  one  I/O  operation  acknowledges  that  1/ O  operations  take  a  very  long 
time  relative  to  CPU  instructions. 

Algorithms  consistent  with  these  assumptions  are  described  by  the  multiway  decision  trees 
discussed  below.  They  are  a  generalization  of  the  binary  decision  tree,  a  binary  tree  in  which 
each  vertex  has  associated  with  it  a  comparison  between  two  variables.  For  example,  if  keys  X\ 
and  X2  are  compared  at  the  root  vertex,  the  comparison  has  two  outcomes,  namely  X\  <  X2  or 
X\  >  X2,  which  are  associated  with  the  subtrees  to  the  left  and  right  of  the  root,  respectively. 
Similar  comparisons  and  outcomes  are  possible  at  each  vertex  of  these  two  subtrees.  A  sequence 
of  comparisons  terminates  on  a  leaf  node. 

Since  a  binary  decision  tree  captures  each  of  the  data-dependent  comparisons  between  keys 
in  comparison-based  sorting  algorithm,  each  leaf  is  associated  with  the  permutation  of  the 
original  sequence  of  variables  that  puts  the  sequence  into  sorted  order.  Thus,  a  binary  decision 
tree  for  sorting  must  have  at  least  n !  distinct  leaves,  one  for  every  permutation  of  n  items.  The 
length  of  a  path  through  a  binary  decision  tree  is  the  number  of  comparisons  performed  on  the 
particular  input,  and  the  length  of  the  longest  path  is  a  measure  of  the  worst-case  number  of 
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comparisons.  A  binary  tree  with  N  leaves  has  a  longest  path  of  length  at  least  log2  N  because 
if  it  were  smaller,  it  would  have  fewer  than  2log2  N  <  N  leaves.  Since  the  length  of  the  longest 
path  is  an  integer,  it  must  be  at  least  |~log2  N~\ .  We  summarize  this  result  as  a  lemma  that  uses 
the  lower  bound  on  nl  given  in  Problem  2.23. 

LEMMA  I  1 .8. 1  The  length  of  the  longest  path  in  a  binary  decision  tree  that  sorts  n  inputs  is  at 
least  flog2  ?z!~|  =  0(nlogn). 

The  multiway  decision  tree  in  Fig.  11.15  extends  the  above  concept  by  permitting  multi¬ 
ple  comparisons  at  each  vertex.  2k  outcomes  are  possible  if  k  comparisons  of  variable  pairs  are 
associated  with  each  vertex. 


THEOREM  I  1 .8. 1  Let  B  divide  M  and  M  divide  n.  Under  Assumption  11.8.1  on  the  BTM, 
in  the  worst  case  the  number  of  block  I/O  steps  to  sort  a  set  ofn  records  using  M  words  of  primary 
memory  and  block  size  B,  XBTMsort(ti),  satisfies  the  following  bounds  for  B  <  M/2  and  M 
large: 


iBTMsortO)  =  0  max 


n  (n/B)log(n/B) 

B’  log  (M/B) 

Proof  Let’s  now  apply  the  multiway  decision  tree  to  the  BTM.  Since  each  path  in  such  a  tree 
corresponds  to  a  sequence  of  comparisons  by  the  CPU,  the  tree  must  have  at  least  n!  leaves. 
To  complete  the  lower-bound  derivation  we  need  to  determine  the  number  of  descendants 
of  vertices  in  the  multiway  tree. 

Initially  the  n  unsorted  words  are  stored  mn/B  blocks  in  the  secondary  memory.  The 
first  time  one  of  these  blocks  is  moved  to  the  primary  memory,  up  to  B\  permutations 
can  be  performed  on  the  words  in  it.  No  more  permutations  are  possible  between  these 
words  no  matter  how  many  times  they  are  simultaneously  in  primary  memory,  even  if  they 
return  to  the  memory  as  members  of  different  blocks.  When  a  block  of  B  words  arrives  in 
the  .M-word  memory,  the  number  of  possible  permutations  between  them  (given  that  the 
order  among  the  M  —  B  words  originally  in  the  memory  has  previously  been  taken  into 


Figure  11.15  A  multiway  decision  tree  in  which  multiple  comparisons  of  keys  are  made  at  each 


vertex. 
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account,  as  has  the  order  among  the  B  words  in  a  block)  is  at  most  p  =  (Jg),  the  binomial 
coefficient.  (To  see  this,  observe  that  places  for  the  B  new  (and  indistinguishable)  words  in 
the  primary  memory  can  be  any  B  of  the  M  indistinguishable  places.)  It  follows  that  the 
multi-comparison  decision  tree  for  every  BTM  comparison-based  sorting  algorithm  on  the 
BTM  has  at  most  n/B  vertices  with  at  most  pB\  possible  outcomes  (vertices  corresponding 
to  the  first  arrival  of  one  of  the  blocks  in  primary  memory)  and  that  each  of  the  other  vertices 
has  at  most  p  outcomes. 

It  follows  that  if  a  sorting  algorithm  executes  iBTMsort(^)  block  I/O  steps,  the  function 
TeTMsort  (n)  must  satisfy  the  following  inequality: 


(. B\)n/B 


TBTMsort(n) 


>  n\ 


Using  the  approximation  to  n\  given  in  Lemma  1 1.8.1,  the  upper  bound  of  (M j B)b eB  on 
(^)  derived  in  Lemma  10.12.1,  and  the  fact  that  T  >  n/B,  we  have  the  desired  conclusion. 

An  upper  bound  is  obtained  by  extending  the  standard  merging  algorithm  to  blocks  of 
keys.  The  merging  algorithm  is  divided  into  phases,  an  initialization  phase  and  merging 
phases,  each  of  which  takes  (2 n/B)  I/O  operations.  In  the  initialization  phase,  a  set  of 
n/M  sorted  sublists  of  M  keys  or  M/ B  blocks  is  formed  by  bringing  groups  of  M  keys  into 
primary  memory,  sorting,  and  then  writing  them  out  to  secondary  memory.  In  a  merging 
phase,  M / B  sorted  sublists  of  L  blocks  ( L  =  M/B  in  the  first  merging  phase)  are  merged 
into  one  sorted  sublist  of  ML/ B  blocks,  as  suggested  in  Fig.  11.16.  The  first  block  of  keys 
(those  with  the  smallest  values)  in  each  sublist  is  brought  into  memory  and  the  B  smallest 
keys  in  this  set  is  written  out  to  the  new  sorted  sublist  that  is  being  constructed.  If  any 
block  from  an  input  sublist  is  depleted,  the  next  block  from  that  list  is  brought  in.  There 
is  always  sufficient  space  in  primary  memory  to  do  this.  Thus,  after  k  phases  the  sorted 
sublists  contain  ( M / B)k  blocks.  When  (M / B)k  >  n/B,  the  merging  is  done.  Thus, 
(2 n/B)  [log 2{n/ B)/ log 2{M/B)~\  I/O  operations  are  performed  by  this  algorithm.  ■ 


Secondary  Memory 


Figure  11.16  The  state  of  the  block  merging  algorithm  after  merging  four  blocks.  The  algo¬ 
rithm  merges  M/B  sublists,  each  containing  L  blocks  of  B  keys. 
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Similar  results  can  be  obtained  for  the  permutation  networks  defined  in  Section  7.8.2  (see 
Problem  11.18),  the  FFT  defined  in  Section  6.7.3  (see  Problem  11.19),  and  matrix  transposi¬ 
tion  defined  in  Section  6.5.4  (see  [9]). 


11.9  The  Hierarchical  Memory  Model 

In  this  section  we  define  the  hierarchical  memory  model  and  derive  bounds  on  the  time  to  do 
matrix  multiplication,  the  FFT  and  binary  search  in  this  model.  These  results  provide  another 
opportunity  to  evaluate  the  performance  of  memory  hierarchies,  this  time  with  a  single  cost 
function  applied  to  memory  accesses  at  all  levels  of  a  hierarchy.  We  make  use  of  lower  bounds 
derived  earlier  in  this  chapter. 

DEFINITION  I  1 .9. 1  The  hierarchical  memory  model  (HMM)  is  a  serial  computer  in  which  a 
CPU  without  registers  is  attached  to  a  random-access  memory  of  unlimited  size  for  which  the  time 
to  access  location  a  for  reading  or  writing  is  the  value  of  a  monotone  nondecreasing  cost  function 
v(a)  :  IN  i— >  IN  from  the  integers  N  =  {0, 1,2,3, .  . .}  to  IN.  The  cost  of  computing 
/(")  :  An  ^  A'"  with  the  HMM  using  the  cost  function  v{a),  K,u(f),  is  defined  as 

T(sc) 

M/)  =  rap  ,y(aj)  (n-2) 

3  =  1 

where  ay  1  <  j  <  T(x),  is  the  address  accessed  by  the  CPU  on  the  jth  computational  step  and 
T(x )  is  the  number  of  steps  when  the  input  is  x. 

The  HMM  with  cost  function  v(a)  —  1  is  the  standard  random-access  machine  described 
in  Section  3.4.  While  in  principle  the  HMM  can  model  many  of  the  details  of  the  MHG,  it 
is  more  difficult  to  make  explicit  the  dependence  of  i fa)  on  the  amount  of  memory  at  each 
level  in  the  hierarchy  as  well  as  the  time  for  a  memory  access  in  seconds  at  that  level.  Even 
though  the  HMM  can  model  programs  with  branching  and  looping,  following  [7]  we  assume 
straight-line  programs  when  studying  the  FFT  and  matrix-matrix  multiplication  problems  with 
this  model. 

Let  n(f,  x,  a)  be  the  number  of  times  that  address  a  is  accessed  in  the  HMM  for  /  on 
input  x.  It  follows  that  the  cost  A©/)  can  be  expressed  as  follows: 

/C„(/)  =  max^  n(f,  x,  a)v(a)  (11.3) 

1  <a 

Many  cost  functions  have  been  studied  in  the  HMM,  including  v(a)  =  |~log2  a] ,  ir(a)  = 
a“,  and  v(a)  =  Um(a),  where  Um(a )  is  the  following  threshold  function  with  threshold  to: 


1  a  >  to 
0  otherwise 


ICum(f)  =  max  ^  n(f,  x,  a) 

m<a 


It  follows  that 
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For  the  matrix-matrix  multiplication  and  FFT  problems,  the  cost  of  computing  /  is 

directly  related  to  the  number  of  I/O  operations  with  the  red-blue  pebble  game  played  with 
S  =  to  red  pebbles  discussed  in  Sections  11.5.2  and  11.5.3.  For  this  reason  we  call  this  cost 
I/O  complexity.  The  principal  difference  is  that  in  the  HMM  no  cost  is  assessed  for  data 
stored  in  the  first  TO  memory  locations. 

Let  the  differential  cost  function  Ai/(a)  be  defined  as 


Ais(a)  =  u(a)  —  u(a  —  1) 

As  a  consequence,  we  can  write  v(a)  as  follows  if  we  set  t/(—  1)  =  0: 

Ka)  =  Av^> 


0  <6<a 


Since  z/(a)  is  a  monotone  nondecreasing  function,  A v{rn)  is  nonnegative. 
Rewriting  (11.3)  using  (11.4),  we  have 

M/)  =  m»x  J2n{f,x,a)  J2  M&) 

<a  0<6<a 

oo  oo 

A^(c)^n(/,  x,d) 


max 

x 


c— 0 


a—c 

oo 


max 

x 


^n(/,  x,d) 


c— 0 

11.9.1  Lower  Bounds  for  the  HMM 


(11-4) 


(11.5) 


Before  deriving  bounds  on  the  cost  to  do  a  variety  of  tasks  in  the  HMM,  we  introduce  the 
binary  search  problem. 

A  binary  tree  is  a  tree  in  which  each  vertex  has  either  one  or  two  descendants  except  leaf 
vertices,  which  have  none.  (See  Fig.  1 1.17.)  Also,  every  vertex  except  the  root  vertex  has  one 


Figure  11.17  A  binary  search  tree. 
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parent  vertex.  The  length  of  a  path  in  a  tree  is  the  number  of  edges  on  that  path.  The 
left  (right)  subtree  of  a  vertex  is  the  subtree  that  is  detached  by  removing  the  left  (right) 
descending  edge.  A  binary  search  tree  is  a  binary  tree  that  has  one  key  at  each  vertex.  (This 
definition  assumes  that  all  the  keys  in  the  tree  are  distinct.)  The  value  of  this  one  key  is  larger 
than  that  of  all  keys  in  the  left  subtree,  if  any,  and  smaller  than  all  keys  in  the  right  subtree,  if 
any.  A  balanced  binary  search  tree  is  a  binary  search  tree  in  which  all  paths  have  length  k  or 
k  +  1  for  some  integer  k. 

LEMMA  I  1.9.1  The  length  of  the  Longest  path  in  a  binary  tree  with  n  vertices  is  at  least  \\og2{n+ 

1)/2J. 

Proof  A  longest  path  in  a  binary  tree  with  n  vertices  is  smallest  when  all  levels  in  the  tree 
are  full  except  possibly  for  the  bottom  level.  If  such  a  tree  has  a  longest  path  of  length  l,  it 
has  between  2l  and  2l+l  —  1  vertices.  It  follows  that  the  longest  path  in  a  binary  search  tree 
containing  n  keys  is  at  least  |~log2(n  +1)  / 2~| .  ■ 

The  binary  search  procedure  searches  a  binary  search  tree  for  a  key  value  v.  It  compares 
v  against  the  root  value,  stopping  if  they  are  equal.  If  they  are  not  equal  and  v  is  less  than  the 
key  at  the  root,  the  search  resumes  at  the  root  vertex  of  the  left  subtree.  Otherwise,  it  resumes 
at  the  root  of  the  right  subtree.  The  procedure  also  stops  when  a  leaf  vertex  is  reached. 

We  can  now  state  bounds  on  the  cost  on  the  HMM  for  the  logarithmic  cost  function 
v(a)  =  [log,  a] .  This  function  applies  when  the  memory  hierarchy  is  organized  as  a  binary 
tree  in  which  the  low-indexed  memory  locations  are  located  closest  to  the  roots  and  the  time 
to  retrieve  an  item  is  proportional  to  the  number  of  edges  between  it  and  the  root.  We  use  it 
to  illustrate  the  techniques  developed  in  the  previous  section. 

Theorem  11.9.1  states  lower  performance  bounds  for  straight-line  algorithms.  Thus,  the 
computation  time  is  independent  of  the  particular  argument  of  the  function  /  provided  as 
input.  Matching  upper  bounds  are  derived  in  the  following  section.  (The  logarithmic  cost 
function  is  polynomially  bounded.) 

THEOREM  I  1.9.1  The  cost  function  zz(a)  =  |~log2  a]  on  the  HMM  for  the  n  x  n  matrix 
multiplication  functiori  f^xB  rea^zed-  by  the  classical  algorithm,  the  n-point  TFT  associated  ivith 
the  graph  F^d\  n  =  2d,  cotnparison-based  sorting  on  n  keys  /©j- ,  and  binary  search  on  n  keys, 
/g Tg,  satisfies  the  following  lower  bounds: 

Matrix  multiplication:  ^-^(/axb)  =  llfn3) 

Fast  Fourier  transform:  JCV(F^)  =  fl(n  log  n  log  log  n) 


Comparison-based  sorting:  M/i ort)  =  fT(nlognloglogn) 
Binary  search:  ^(/bs*)  =  fT(log2  n) 


Proof  The  lower  bounds  for  the  logarithmic  cost  function  i /(a)  =  |~log2  a]  use  the  fact 
that  A  v(a)  =  1  when  a  =  2k  for  some  integer  k  but  is  otherwise  0.  It  follows  from  (11.5) 
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that 

t 

M/)  =  £*^(/)  (H-6) 

k=l 


for  the  task  characterized  by  /,  where  t  satisfies  2*  <  N  and  TV  is  the  space  used  by  task. 
TV  =  2n2  for  n  x  n  matrix  multiplication,  TV  =  n  for  the  FFT  graph  F^d\  and  TV  =  n  for 
binary  search. 

In  Theorem  11.5.3  it  was  shown  that  the  number  of  I/O  operations  to  perform  n  X  n 
matrix  multiplication  with  the  classical  algorithm  is  il(n3  /  y/rn) .  The  model  of  this  theorem 
assumes  that  none  of  the  inputs  are  in  the  primary  memory,  the  equivalent  of  the  first  m 
memory  locations  in  the  HMM. 

Since  no  charge  is  assessed  by  the  Um(a )  cost  function  for  data  in  the  first  m  memory 
locations,  a  lower  bound  on  cost  with  this  measure  can  be  obtained  from  the  lower  bound 
obtained  with  the  red-blue  pebble  game  by  subtracting  m  to  take  into  account  the  first  m 
I/O  operations  that  need  not  be  performed. 

Thus  for  matrix  multiplication,  ICijm (J'a'xb)  =  ((n3 / y/rn)  —  m).  Since 


(n3/^m)  —  m  >  (>/8  —  l)n3/V8m 

when  m  <  n2/ 2,  it  follows  from  (11.6)  that  /CV(/^B)  =  f2(n3)  because  Y2k=o  77-3/2fe  = 
f l(n3). 

For  the  same  reason,  KUm{FW)  =  f!  ((nlogn)/logm  —  to)  (see  Theorem  11.5.5) 
and  (nlogn/logm)  —  m  >  nlogn/(21ogm)  for  m  <  n/2.  It  follows  that 
satisfies 


JC„{F{d))  =  n 


=  n 


(n  log  n  log  log  n) 


The  last  equation  follows  from  the  observation  that  X^fe=i  1/&  is  closely  approximated  by 

2  dx,  which  is  In  p.  (See  Problem  1 1.2.) 

The  lower  bound  for  comparison-based  sorting  uses  the  f l(n  log  n/  log  m)  sorting  lower 
bound  for  the  BTM  with  a  block  size  B  =  1 .  Since  the  BTM  assumes  that  no  data  are  res¬ 
ident  in  the  primary  memory  before  a  computation  begins,  the  lower  bound  for  the  HMM 
cost  under  the  Um  cost  function  is  f!  ( (n  log  n/  log  to)  —  to)  .  Thus,  the  FFT  lower  bound 
applies  in  this  case  as  well. 

Finally,  we  show  that  the  lower  bound  for  binary  search  is  fcum(fBs)  =  H(logn  — 
log  to).  Each  path  in  the  balanced  binary  search  tree  has  length  d  =  [log(n  +  l)/2]  or 
d  —  1 .  Choose  a  query  path  that  visits  the  minimum  number  of  variables  located  in  the  first 
to  memory  locations.  To  make  this  minimum  number  as  large  as  possible,  place  the  items 
in  the  first  m  memory  locations  as  close  to  the  root  as  possible.  They  will  form  a  balanced 
binary  subtree  of  path  length  l  =  [log2(TO  +  1 )  / 2~|  or  l  —  1.  Thus  no  full  path  will  have 
more  than  l  edges  and  l  —  1  variables  from  the  first  m  memory  locations.  It  follows  that 
there  is  a  path  containing  at  least  d—  1  —  (Z  —  1)  =  d  —  l  =  [log(n  +  1)]  —  [log(m  +  1)] 
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variables  that  are  not  in  the  first  m  memory  locations.  At  least  one  I/O  operation  is  needed 
per  variable  to  operate  on  them.  It  thus  follows  that 

log  n 

M/bs})  =  tt(logn-  log(2d)) 
d= 0 
log  n 

=  ^  f/(logn  —  d) 

d=  0 

=  12  (log2  n) 

The  last  inequality  is  a  consequence  of  the  fact  that  log  n  —  d  is  greater  than  (log  n)  / 2  for 
d  <  (logn)/2.  ■ 

Lower  bounds  on  the  I/O  complexity  for  these  problems  can  be  derived  for  a  large  variety 
of  cost  functions.  The  reader  is  asked  in  Problem  1 1 .20  to  derive  such  bounds  for  the  cost 
function  u{a)  =  aa . 

1 1.9.2  Upper  Bounds  for  the  HMM 

A  natural  question  in  this  context  is  whether  these  lower  bounds  can  be  achieved.  We  al¬ 
ready  know  from  Theorems  11.5.3  and  11.5.5  that  for  each  allocation  of  memory  to  each 
memory-hierarchy  level,  it  is  possible  to  match  upper  and  lower  bounds  on  the  number  of  I/O 
operations  and  computation  time.  As  a  consequence,  for  each  of  these  problems  near-optimal 
solutions  exist  for  any  cost  function  on  memory  accesses  for  these  problems. 

11.10  Competitive  Memory  Management 

The  results  stated  above  for  the  hierarchical  memory  model  assume  that  the  user  has  explicit 
control  over  the  location  of  data,  an  assumption  that  does  not  apply  if  storage  is  allocated  by  an 
operating  system.  In  this  section  we  examine  memory  management  by  an  operating  system 
for  the  HMM  model,  that  is,  algorithms  that  respond  to  memory  requests  from  programs  to 
move  stored  items  (instructions  and  data)  up  and  down  the  memory  hierarchy.  We  examine 
offline  and  online  memory  management  algorithms.  An  offline  algorithm  is  one  that  has 
complete  knowledge  of  the  future.  Online  algorithms  cannot  predict  the  future  and  must  act 
only  on  the  data  received  up  to  the  present  time. 

We  use  competitive  analysis,  a  type  of  analysis  not  appearing  elsewhere  in  this  book,  to 
show  that  the  two  widely  used  online  page-replacement  algorithms,  least  recently  used  (LRU) 
and  first-in,  first-out  (FIFO),  use  about  twice  as  many  I/O  operations  as  does  MIN,  the  opti¬ 
mal  offline  page-replacement  algorithm,  when  these  two  algorithms  are  allowed  to  use  about 
twice  as  much  memory  as  MIN.  Competitive  analysis  bounds  the  performance  of  an  online 
algorithm  in  terms  of  that  of  the  optimum  offline  algorithm  for  the  problem  without  knowing 
the  performance  of  the  optimum  algorithm. 

Virtual  memory-management  systems  allow  the  programmer  to  program  for  one  large 
virtual  random-access  memory,  such  as  that  assumed  by  the  HMM,  although  in  reality  the 
memory  contains  multiple  physical  memory  units  one  of  which  is  a  fast  random-access  unit 
accessed  by  the  CPU.  In  such  systems  the  hardware  and  operating  system  cooperate  to  move 
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data  from  secondary  storage  units  to  the  primary  storage  unit  in  pages  (a  collection  of  items). 
Each  reference  to  a  virtual  memory  location  is  checked  to  determine  whether  or  not  the  refer¬ 
enced  item  is  in  primary  memory.  If  so,  the  virtual  address  is  converted  to  a  physical  one  and 
the  item  fetched  by  the  CPU.  If  not  (if  a  page  fault  occurs),  the  page  containing  the  virtual 
address  is  moved  into  primary  memory  and  the  tables  used  to  translate  virtual  addresses  are 
updated.  The  item  at  the  virtual  address  is  then  fetched.  To  make  room  for  the  newly  fetched 
page,  one  page  in  the  fast  memory  is  moved  up  the  memory  hierarchy. 

A  page-replacement  algorithm  is  an  algorithm  that  decides  which  page  to  remove  from  a 
full  primary  memory  to  make  space  for  a  new  page.  We  describe  and  analyze  page-replacement 
algorithms  for  two-level  memory  hierarchies  both  because  they  are  important  in  their  own  right 
and  because  they  are  used  as  building  blocks  for  multi-level  page-replacement  algorithms.  A 
two-level  hierarchy  has  primary  and  secondary  memories.  Let  the  primary  memory  contain  n 
pages  and  let  the  secondary  memory  be  of  unlimited  size. 

The  FIFO  (first-in,  first-out)  page-replacement  algorithm  is  widely  used  because  it  is  sim¬ 
ple  to  implement.  Under  this  replacement  policy,  the  page  replaced  is  the  first  page  to  have 
arrived  in  primary  memory.  The  LRU  (least  recently  used)  replacement  algorithm  requires 
keeping  for  each  page  the  time  it  was  last  accessed  and  then  choosing  for  replacement  the  page 
with  the  earliest  time,  an  operation  that  is  more  expensive  to  implement  than  the  FIFO  shift 
register. 

Under  the  optimal  two-level  page-replacement  algorithm,  called  MIN,  primary  memory 
is  initialized  with  the  first  n  pages  to  be  accessed.  MIN  replaces  the  page  pi  in  primary  memory 
whose  time  U  of  next  access  is  largest.  If  some  other  page,  pj,  were  replaced  instead  of  pi,  pj 
would  have  to  return  to  the  primary  memory  before  Pi  is  next  accessed,  and  one  more  page 
replacement  would  occur  than  is  required  by  MIN. 

Implementing  MIN  requires  knowledge  of  the  future,  a  completely  unreasonable  assump¬ 
tion  on  the  part  of  the  operating  system  designer.  Nonetheless,  MIN  is  very  useful  as  a  standard 
against  which  to  compare  the  performance  of  other  page-replacement  algorithms  such  as  FIFO 
and  LRU. 

11.10.1  Two-Level  Memory-Management  Algorithms 

To  compare  the  performance  of  FIFO,  LRU,  and  MIN,  we  characterize  memory  use  by  a 
memory-address  sequence  s  =  {si,  S2, . . .}  of  HMM  addresses  accessed  by  a  computation. 
We  assume  that  no  memory  entries  are  created  or  destroyed.  We  let  FfifoT;  s)>  A,Ru(n>  s)> 
and  s)  be  the  number  of  page  faults  with  each  page-replacement  algorithm  on  the 

memory  address  sequence  s  when  the  primary  memory  holds  n  pages. 

We  now  bound  the  performance  of  the  FIFO  and  LRU  page-replacement  algorithms  in 
terms  of  that  of  MIN.  We  show  that  if  the  number  of  pages  available  to  FIFO  and  LRU 
is  double  the  number  available  to  MIN,  the  number  of  page  faults  with  FIFO  and  LRU  is 
at  most  about  double  the  number  with  MIN.  It  follows  that  FIFO  and  LRU  are  very  good 
page-replacement  algorithms,  a  result  seen  in  practice. 

THEOREM  I  I .  I  0. 1  Let  npiFO  «lru>  an d  «min  be  the  number  of  primary  memory  pages  used 
by  the  FIFO,  LRU,  and  MLN  algorithms.  Let  n-pwo  >  'R-min  and  tilru  >  timin-  Then,  for 
any  memory-address  sequence  s  the  following  inequalities  hold: 

EfifoTfifo.  s)  <  - — — - —  Tmin(«min>  s)  +  timin 

TT-FIFO  “  ^MIN  +  1 
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-pLRU©LRU>  s)  <  - - - —  TmIN©MIN,  s)  +  TlMIN 

™LRU  -  rVMIN  +  1 

Proof  We  establish  the  result  for  FIFO,  leaving  it  to  the  reader  to  show  it  for  LRU.  (See 
Problem  1 1.23.)  Consider  a  contiguous  subsequence  t  of  s  that  immediately  follows  a  page 
fault  under  FIFO  and  during  which  FIFO  makes  d>FIFO  =  /  <  tififo  page  faults.  In  the 
next  paragraph  we  show  that  at  least  /  different  pages  are  accessed  by  FIFO  during  t .  Let 
MIN  make  </MIN  faults  during  t.  Because  MIN  has  timin  pages,  </>MIN  >  /  —  7Xmin  +  1  > 
0.  Thus,  the  ratio  of  page  faults  by  FIFO  and  MIN  is  / /(j)M1N  <  //(/  —  timin  +  1). 

Let  Pi  be  the  page  on  which  the  fault  occurs  just  before  the  start  of  t.  To  show  that  at 
least  /  different  pages  are  accessed  by  FIFO  during  t,  consider  the  following  cases:  a)  FIFO 
faults  on  pi  in  t;  b)  FIFO  faults  on  some  other  page  at  least  twice  in  t ;  and  c)  neither  case 
applies.  In  the  first  case,  FIFO  accesses  at  least  tififo  different  pages  because  if  it  accessed 
fewer,  then  pt  would  still  be  in  its  primary  memory  the  second  time  it  is  accessed.  In  the 
second  case,  the  same  statement  applies  to  the  page  accessed  multiple  times.  In  the  third 
case,  FIFO  can  have  only  /  faults  if  it  accesses  at  least  /  different  pages  during  t. 

Now  subdivide  the  memory  access  sequence  s  into  subsequences  t0,t\, ...  ,tk  such  that 
ti,  i  >  1,  starts  immediately  after  a  page  fault  under  FIFO  and  contains  tififo  faults  and 
fo  contains  at  most  tififo  page  faults.  This  set  of  subsequences  can  be  found  by  scanning  s 
backwards.  Since  MIN  makes  fy^IIN  >  npiFO  —  timin  +  1  faults  on  the  j th  interval,  j  >  1, 
and  </>^IN  >  </>FIFO  —  71min  faults  on  the  zeroth  interval  (that  is,  </>FIFO  <  i^IN  +  J1min)> 
the  number  of  faults  by  FIFO,  Tfifo(^fifo>  s)  =  +  </>fIFO  +  •  •  •  +  </>fcIFO  satisfies 

the  condition  of  the  theorem  because  </>FIFO  <  tififo  </)^IIN/©fifo  —  vimin  +  1)  for 

j  >  1.  ■ 

The  upper  bounds  are  almost  best  possible  because,  as  stated  in  Problem  11.24,  for  any 
online  algorithm  A  there  is  a  memory-access  sequence  such  that  the  number  of  page  faults 
Fa(s)  satisfies  the  following  lower  bound: 

Fa(ha,s)  >  - — - —  Tmin(^min»  s) 

riA  -  «min  +  1 

The  difference  between  this  lower  bound  and  the  upper  bounds  given  for  FIFO  and  LRU 
is  timin  >  which  takes  into  account  for  the  possibility  that  the  initial  entries  in  the  primary 
memory  of  MIN  and  FIFO  can  be  completely  different. 

It  follows  that  the  FIFO  and  LRU  page-replacement  strategies  are  very  effective  strategies 
for  two-level  memory  hierarchies. 


Problems 

MATHEMATICAL  PRELIMINARIES 

11.1  Let  a  and  b  be  integers  satisfying  1  <  a  <  b.  Show  that  6/2  <  a[6/aj  <  6. 

Hint:  Consider  values  of  6  in  the  range  ka  <  6  <  [k  +  l)a  for  k  an  integer. 

11.2  Derive  a  good  lower  bound  on  X^fcLi  ( 1  /&)  of  the  form  f2(log  to)  using  an  approach 
similar  to  that  of  Problem  2.2. 
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PEBBLING  MODELS 

11.3  Show  that  the  graph  of  Fig.  11.2  can  be  completely  pebbled  in  the  three-level  MHG 
with  resource  vector  p  =  (2,  4)  using  only  four  third-level  pebbles. 

11.4  Consider  pebbling  a  graph  with  the  red-blue  game.  Suppose  that  each  I/O  operation 
uses  twice  as  much  time  as  a  computation  step.  Show  by  example  that  a  red-blue 
pebbling  minimizing  the  total  time  to  pebble  a  graph  does  not  always  minimize  the 
number  of  I/O  operations. 

I/O  TIME  RELATIONSHIPS 

11.5  Let  S'min  be  the  minimum  number  of  pebbles  needed  to  pebble  the  graph  G  =  (V,  E ) 
in  the  red  pebble  game.  Show  that  if  in  the  MHG  a  pebbling  strategy  V  uses  s&  pebbles 
at  level  k  or  less  and  s &  >  Sm in  +  k  —  1,  then  no  I/O  operations  at  level  k  +  1  or 
higher  are  necessary  except  on  input  and  output  vertices  of  G. 

11.6  The  rules  of  the  red-blue  pebble  game  suggest  that  inputs  should  be  prefetched  from 
high-level  memory  units  early  enough  that  they  arrive  when  needed.  Devise  a  schedule 
for  delivering  inputs  so  that  the  number  of  I/O  operations  for  matrix  multiplication  is 
minimized  in  the  red-blue  pebble  game. 

THE  HONG-KUNG  LOWER-BOUND  METHOD 

11.7  Derive  an  expression  for  the  S'-span  p(S,  G )  of  the  binary  tree  G  shown  in  Fig.  1 1 .4. 

1 1.8  Consider  the  pyramid  graph  G  on  n  inputs  shown  in  Fig.  11.18.  Determine  its  £>-span 
p(S,  G)  as  a  function  of  S. 

11.9  In  Problem  2.3  it  is  shown  that  every  binary  tree  with  k  leaves  has  k  —  1  internal  vertices. 
Show  that  if  t  binary  trees  have  a  total  of  p  pebbles,  at  most  p  —  1  pebbling  steps  are 
possible  on  these  trees  from  an  arbitrary  initial  placement  without  re-pebbling  inputs. 
Hint:  The  vertices  that  can  be  pebbled  from  an  initial  placement  of  pebbles  form  a  set 
of  binary  trees. 

11.10  An  I/O  operation  is  simple  if  after  a  pebble  is  placed  on  a  vertex  the  pebble  currently 
residing  on  that  vertex  is  removed.  Show  that  at  most  twice  as  many  I/O  operations  are 
used  at  each  level  by  the  MHG  when  every  I/O  operation  is  simple. 


Figure  11.18  The  pyramid  graph. 
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Hint:  Compare  pebble  placement  with  and  without  the  requirement  that  placements 
be  simple,  arguing  that  if  a  pebble  removed  by  a  simple  I/O  operation  is  needed  later  it 
can  be  obtained  by  one  simple  I/O  operation  for  each  of  the  original  I/O  operations. 

TRADEOFFS  IN  THE  MEMORY  HIERARCHIES 

11.11  Using  the  results  of  Problem  11.8,  derive  good  upper  and  lower  bounds  on  the  I/O 
time  to  pebble  the  pyramid  graph  of  Fig.  11.18  in  terms  of  n. 

11.12  Under  the  conditions  of  Problem  11.4,  show  that  any  pebbling  of  a  DAG  for  convolu¬ 
tion  of  n-sequences  with  the  minimal  pebbling  strategy  when  S  >  //mm  and  n  is  large 
has  much  larger  total  cost  than  a  strategy  that  treats  blue  pebbles  as  red  pebbles. 

BLOCK  I/O  IN  THE  MHG 

11.13  Determine  how  efficiently  matrix-vector  multiplication  can  be  done  in  the  block-I/O 
model  described  in  Section  1 1.6. 

1 1.14  Show  that  matrix-matrix  multiplication  can  be  done  efficiently  in  the  block-I/O  model 
described  in  Section  1 1.6. 

SIMULATING  FAST  MEMORIES 

11.15  Determine  conditions  on  a  memory  hierarchy  under  which  the  FFT  can  be  executed 
efficiently  in  the  standard  MHG.  Discuss  the  extent  to  which  these  conditions  are  likely 
to  be  met  in  practice. 

11.16  Repeat  the  previous  problem  for  convolution  realized  by  the  algorithm  stated  in  the 
convolution  theorem. 

11.17  The  definition  of  a  minimal  pebbling  stated  in  Section  11.2  assumes  that  it  is  much 
more  expensive  to  perform  a  high-level  I/O  operation  than  a  low-level  one.  Determine 
the  extent  to  which  the  lower  bound  of  Theorem  11.4.1  depends  on  this  assumption. 
Apply  your  insight  to  the  problem  of  matrix  multiplication  of  n  X  n  matrices  in  the 
three-level  MHG  in  which  si  <  3 n2  and  s 2  >  3 n2.  (See  Theorem  1 1.5.3.)  Determine 
whether  increasing  the  number  of  level-3  I/O  operations  affects  the  number  of  level-2 
I/O  operations. 

THE  BLOCK-TRANSFER  MODEL 

11.18  Derive  a  lower  bound  on  the  time  to  realize  a  permutation  network  on  n  inputs  in  the 
block-transfer  model. 

Hint:  Count  the  number  of  orderings  possible  between  the  n  inputs.  Base  your  argu¬ 
ment  on  the  number  of  orderings  within  blocks  and  between  elements  in  the  primary 
memory,  and  the  number  of  ways  of  choosing  which  block  from  the  secondary  memory 
to  move  into  the  primary  memory. 

11.19  Derive  a  lower  bound  on  the  time  to  realize  the  FFT  graph  on  n  inputs  in  the  block- 
transfer  model. 

Hint:  Use  the  result  of  Section  7.8.2  to  argue  that  an  71-point  FFT  graph  cannot  have 
many  fewer  vertices  than  there  are  switches  in  a  permutation  network. 
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THE  HIERARCHICAL  MEMORY  MODEL 

11.20  Derive  the  following  lower  bounds  on  the  cost  of  computing  the  following  functions 
when  the  cost  function  is  v(a)  =  a“: 


1 

r  n(n2a+2) 

if  a  >  1/2 

Matrix  multiplication: 

M/Ixb)  = 

'  H(n3logn) 

if  a  =  1/2 

1 

^  H(n3) 

if  a  <  1/2 

Fast  Fourier  transform: 

JCin\F^)  =  U(n“+1) 

Binary  search:  ^(/bs*)  =  ^(nQ) 

Hint:  Use  the  following  identity  to  recast  expressions  for  the  computation  time: 


n  n—  1 

]Ta  g{k)h{k)  =  -J2  A  h(k)g(k  +  1)  +  g(n  +  1  )h(n)  —  g(l)h(l) 
k- 1  k= 1 

11.21  A  cost  function  v(u)  is  polynomially  bounded  if  for  some  K  >  1  and  all  as  >  1. 
v(2a)  <  Kv(a).  Let  the  cost  function  z/(a)  be  polynomially  bounded.  Show  that 
there  are  positive  constants  c  and  d  such  that  t'(a)  <  cad. 

1 1.22  Derive  a  good  upper  bound  on  the  cost  to  sort  in  the  HMM  with  the  logarithmic  cost 
function  [log  a] . 


COMPETITIVE  MEMORY  MANAGEMENT 


11.23  By  analogy  with  the  proof  for  FIFO  in  the  proof  of  Theorem  11.10.1,  consider  any 
memory-address  sequence  s  and  a  contiguous  subsequence  t  of  s  that  immediately 
follows  a  page  fault  under  LRU  and  during  which  LRU  makes  <[>LRU  =  /  <  tilru 
page  faults.  Show  that  at  least  /  different  pages  are  accessed  by  LRU  during  t. 

1 1 .24  Let  A  be  any  online  page-replacement  algorithm  that  uses  ua  pages  of  primary  memory. 
Show  that  there  are  arbitrarily  long  memory-address  sequences  s  such  that  the  number 
of  page  faults  with  A,  Fa(s),  satisfies  the  following  lower  bound,  where  timin  is  the 
number  of  pages  used  by  the  optimal  algorithm  MIN: 


Fa(s)  > 


UA 

nA  -  flMIN  +  1 


Tmin(s) 


Hint:  Design  a  memory- address  sequence  s  of  length  ua  with  the  property  that  the 
first  ha  —  riMiN  +  1  accesses  by  A  are  to  pages  that  are  neither  in  As  or  MIN’s  primary 
memory.  Let  S  be  the  ua  +  1  pages  that  are  either  in  MIN’s  primary  memory  initially 
or  those  accessed  by  A  during  the  first  ua  —  tiMiN  +  1  accesses.  Let  the  next  timin  ~  1 
page  accesses  by  A  be  to  pages  not  in  S. 
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Chapter  Notes 

Hong  and  Kung  [137]  introduced  the  first  formal  model  for  the  I/O  complexity  of  problems, 
the  red-blue  pebble  game,  an  extension  of  the  pebble  game  introduced  by  Paterson  and  Hewitt 
[239].  The  analysis  of  Section  11.1.2  is  due  to  Kung  [178].  Hong  and  Kung  derived  lower 
bounds  on  the  number  of  I/O  operations  needed  for  specific  graphs  for  matrix  multiplication 
(Theorem  11.5.2),  the  FFT  (Theorem  11.5.4),  odd-even  transposition  sort  and  a  number  of 
other  problems.  Savage  [295]  generalized  the  red-blue  pebble  game  to  the  memory-hierarchy 
game,  simplified  the  proof  of  Theorem  1 1.4.1,  and  obtained  Theorems  1 1.5.3  and  1 1.5.5  and 
the  results  of  Section  1 1.3.  Lemma  1 1.5.2  is  implicit  in  the  work  of  Hong  and  Kung  [137]; 
the  simplified  proof  given  here  is  due  to  Agrawal  and  Vitter  [9].  The  results  of  Section  1 1.5.4 
are  due  to  Savage  [295]. 

The  two-level  contiguous  block- transfer  model  of  Section  11.8.1  was  introduced  by  Savage 
and  Vitter  [296]  in  the  context  of  parallel  space-time  tradeoffs.  The  analysis  of  sorting  of 
Section  1 1.8.1  is  due  to  Agrawal  and  Vitter  [9].  In  this  paper  they  also  derive  similar  bounds 
on  the  I/O  time  to  realize  the  FFT,  permutation  networks  and  matrix  transposition. 

The  hierarchical  memory  model  of  Section  11.9  was  introduced  by  Aggarwal,  Alpern, 
Chandra,  and  Snir  [7] .  They  studied  a  number  of  problems  including  matrix  multiplication, 
the  FFT,  sorting  and  circuit  simulation,  and  examined  logarithmic,  linear,  and  polynomial 
cost  functions.  The  two-level  bounds  of  Section  11.10  are  due  to  Sleator  and  Tarjan  [311]. 
Aggarwal,  Alpern,  Chandra,  and  Snir  [7]  extended  this  model  to  multiple  levels.  The  MIN 
page-replacement  algorithm  described  in  Section  1 1.10  is  due  to  Belady  [35]. 

Two  other  I/O  models  of  interest  are  the  BT  model  and  the  uniform  memory  hierarchy. 
Aggarwal,  Chandra,  and  Snir  [8]  introduced  the  BT  model,  an  extension  of  the  HMM  model 
supporting  block  transfers  in  which  a  block  of  size  b  ending  at  location  x  is  allowed  to  move 
in  time  / ( x )  +  b.  They  establish  tight  bounds  on  computation  time  for  problems  including 
matrix  transpose,  FFT,  and  sorting  using  the  cost  functions  [log  x~\ ,  x,  and  xa  for  1  <  a  <  1 . 

Alpern,  Carter,  and  Feig  [18]  introduced  the  uniform  memory  hierarchy  in  which  the 
wth  memory  has  capacity  ap2u,  block  size  pu,  and  time  pu  /  f3(u)  to  move  a  block  between 
levels;  f3(u)  is  a  bandwidth  function.  They  allow  I/O  overlap  between  levels  and  determine 
conditions  under  which  matrix  transposition,  matrix  multiplication,  and  Fourier  transforms 
can  and  cannot  be  done  efficiently. 

Vitter  and  Shriver  [354]  have  examined  three  parallel  memory  systems  in  which  the  mem¬ 
ories  are  disks  with  block  transfer,  of  the  HMM  type,  or  of  the  BT  type.  They  present  a 
randomized  version  of  distribution  sort  that  meets  the  lower  bounds  for  these  models  of  com¬ 
putation.  Nodine  and  Vitter  [232]  give  an  optimal  deterministic  sorting  algorithm  for  these 
memory  models. 
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VLSI  Models  of  Computation 


The  electronics  revolution  initiated  by  the  invention  of  the  transistor  by  Schockley,  Brattain, 
and  Bardeen  in  1947  accelerated  with  the  invention  of  the  integrated  circuit  in  1958  and  1959 
by  Jack  Kilby  and  Robert  Noyce.  An  integrated  circuit  contains  wires,  transistors,  resistors, 
and  other  components  all  integrated  on  the  surface  of  a  chip,  a  piece  of  semiconductor  material 
about  the  size  of  a  thumbnail.  And  the  revolution  continues.  The  number  of  components  that 
can  be  placed  on  a  semiconductor  chip  has  doubled  almost  every  18  months  for  about  40  years. 
Today  more  than  10  million  of  them  can  fit  on  a  single  chip.  Integrated  circuits  with  very  large 
numbers  of  components  exhibit  what  is  known  as  very  large-scale  integration  (VLSI).  This 
chapter  explores  the  new  models  that  arise  as  a  result  of  VLSI. 

As  the  size  of  the  electronic  components  decreased  in  size,  the  area  occupied  by  wires 
consumed  an  increasing  fraction  of  chip  area.  In  fact,  today  some  applications  devote  more 
than  half  of  their  area  to  wires.  In  this  chapter  we  examine  VLSI  models  of  computation 
that  take  this  fact  into  account.  Using  simulation  techniques  analogous  to  those  employed  in 
Chapter  3,  we  show  that  the  performance  of  algorithms  on  VLSI  chips  can  be  characterized 
by  the  product  AT2,  where  A  is  the  chip  area  and  T  is  the  number  of  steps  used  by  a  chip 
to  compute  a  function.  We  relate  AT 2  to  the  planar  circuit  size  CPtQ,(f)  of  a  function  /,  a 
measure  that  plays  the  role  for  VLSI  chips  that  circuit  size  plays  for  FSMs.  The  AT2  measure 
is  the  direct  analog  of  the  measure  Cq(5,  A  )T  for  the  finite-state  machine  that  was  introduced 
in  Chapter  3,  where  Cq(5,  A)  is  the  size  of  a  circuit  to  simulate  the  next-state  and  output 
functions  of  the  FSM.  We  also  relate  the  measure  A2T  to  CPtd(f). 
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III  The  VSLI  Challenge 

The  design  of  VLSI  chips  represents  an  enormous  intellectual  challenge  akin  to  that  of  con¬ 
structing  very  large  programs.  They  each  involve  the  assembly  of  millions  of  elements,  instruc¬ 
tions  in  the  case  of  software,  and  electronic  components  in  the  case  of  chips.  The  design  and 
implementation  of  VLSI  chips  is  also  challenging  because  it  involves  many  steps  and  many 
technologies.  In  this  section  we  provide  a  brief  introduction  to  this  process  as  preparation 
for  the  introduction  of  the  VLSI  models  and  algorithms  that  are  the  principal  topics  of  this 
chapter. 

12.1.1  Chip  Fabrication 

A  VLSI  chip  consists  of  a  number  of  conducting,  insulating,  and  doped  layers  that  are  placed 
on  a  semiconductor  substrate.  (A  doped  layer  is  created  on  the  surface  of  the  substrate  by 
infusing  small  concentrations  of  impurities  into  the  semiconductor.  This  is  called  doping.) 
The  layers  are  created  using  masks,  templates  with  open  regions  through  which  ionizing  radi¬ 
ation  is  projected  onto  the  surface  of  the  semiconductor.  The  radiation  changes  the  chemical 
properties  of  a  previously  deposited  photosensitive  material  so  that  the  exposed  regions  can 
be  washed  away  with  a  solvent.  The  material  that  is  now  exposed  can  be  doped  or  removed. 
Doping  is  used  to  create  transistors  and  wires.  A  removal  step  is  used  when  a  metallic  layer  has 
been  previously  deposited  from  which  sections  are  to  be  removed,  leaving  wires.  A  chip  may 
have  several  layers  of  wires  separated  by  layers  of  insulating  material  in  addition  to  the  doped 
layers  that  form  transistors  and  wires.  The  layout  of  a  NAND  gate  is  shown  schematically  in 
Fig.  12.1,  in  which  the  shadings  of  rectangles  and  annotations  identify  to  a  chip  designer  the 
types  of  materials  used  to  realize  the  gate. 


(a) 

Figure  1 2. 1  The  schematic  layout  of  a  NAND  gate  and  its  logical  symbol. 


(b) 
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Geometric  design  rules  specify  the  amounts  of  overlap  of  and  separation  between  metal  and 
dopant  rectangles  that  are  needed  to  guarantee  the  desired  electrical  and  electronic  properties  of 
a  VLSI  circuit.  If  wires  are  too  thin,  electrons,  which  move  through  them  at  very  high  speeds, 
can  cause  excess  heating  as  well  as  dislodge  atoms  and  create  an  open  circuit  (this  is  called 
metal  migration),  especially  at  points  at  which  a  wire  bends  to  descend  into  a  well  created 
during  chip  fabrication.  Similarly,  if  wires  are  too  close,  an  error  in  registration  of  masks  may 
cause  short  circuits  between  wires.  Also,  since  transistors  are  constructed  through  the  doping 
and  overlaying  of  insulating  and  conducting  materials,  if  the  regions  defining  a  transistor  are 
too  small,  it  will  not  behave  as  expected. 

The  geometric  design  rules  for  a  particular  chip  technology  can  be  quite  complex.  For  the 
purpose  of  analysis  they  are  simplified  into  a  few  rules  concerning  the  width  and  separation 
of  rectangles,  the  amount  of  area  required  for  contacts  between  wires  on  layers  separated  by 
insulation,  and  the  size  of  the  various  rectangular  regions  that  form  gates  and  transistors.  As 
suggested  by  this  discussion,  a  VLSI  chip  is  quasiplanar;  that  is,  its  components  lie  on  a  few 
layers,  which  are  separated  by  insulation  except  where  contacts  are  made  between  layers. 


12.1.2  Design  and  Layout 

Many  tools  and  techniques  have  been  developed  to  address  the  complexity  of  chip  layout. 
Typically  these  tools  and  techniques  use  abstraction;  that  is,  they  decompose  a  problem  into 
successively  lower  level  units  of  increasing  complexity.  At  each  level  the  number  of  units  in¬ 
volved  in  a  design  is  kept  small  so  that  the  design  is  comprehensible. 

The  design  of  a  VLSI  chip  begins  with  the  specification  of  its  functionality  at  the  func¬ 
tional  or  algorithmic  level.  Either  a  function  or  an  algorithm  is  given  as  the  starting  point. 
An  algorithm  is  then  produced  and  translated  into  a  specification  at  the  architectural  level. 
At  this  level  a  chip  is  specified  in  terms  of  large  units  such  as  a  CPU,  random-access  memory, 
bus,  floating-point  unit,  and  I/O  devices.  (The  material  of  Chapters  3  and  4  is  relevant  at  this 
level.)  After  an  architectural  specification  is  produced,  design  commences  at  the  logical  level. 
Here  particular  methods  for  realizing  architectural  units  are  chosen.  For  example,  an  adder 
could  be  realized  either  as  a  ripple  or  a  carry-lookahead  adder  depending  on  the  stated  speed 
and  cost  objectives.  (The  material  of  Chapter  2  applies  at  this  level.) 

At  the  gate  level,  the  next  level  in  the  design  process,  a  technology,  such  as  NMOS  and 
CMOS,  is  chosen  in  which  to  realize  the  transistors  and  wires.  This  involves  specifications  of 
widths  for  wires,  the  number  of  layers  of  metal,  and  other  things.  If  new  transistor  layouts  are 
used,  their  physics  is  often  simulated  to  determine  their  electrical  properties. 

At  the  next  level,  the  layout  level,  a  gate-level  design  is  translated  into  physical  positions  for 
modules,  gates,  and  wires.  Often  at  this  level  a  rough  layout  is  produced  manually,  after  which 
automatic  routing  and  compaction  algorithms  are  invoked  to  route  wires  between  modules 
and  squeeze  out  the  unnecessary  area.  Space  must  be  reserved  on  each  layout  for  I/O  pads, 
rectangular  regions  large  enough  to  connect  external  wires.  They  serve  as  ports  through  which 
data  is  read  and  written.  Because  these  wires  and  pads  are  very  large  by  comparison  with  the 
wires  on  the  chip,  there  is  a  practical  limit  on  the  number  of  I/O  ports  on  a  chip.  A  port  can 
be  both  an  input  and  an  output  port. 

Once  a  layout  is  complete  it  is  usually  simulated  logically,  that  is,  at  the  level  of  Boolean 
gates.  Parts  of  it  may  also  be  simulated  electrically,  a  much  more  time-consuming  process  given 
the  much  lower  level  of  detail  that  it  entails. 
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After  a  chip  has  been  fabricated  it  is  then  tested.  Because  the  testing  process  for  a  complete 
chip  cannot  be  exhaustive,  due  to  the  number  of  configurations  that  are  possible,  subunits  are 
often  isolated  and  tested.  Testing  circuitry  is  often  built  into  a  chip  to  simplify  the  testing 
process. 

Because  the  design,  layout,  simulation,  and  testing  of  VLSI  chips  is  complex  and  error 
prone,  computer-aided  design  (CAD)  tools  have  been  developed.  CAD  is  very  large  subject 
beyond  the  scope  of  this  book.  Instead,  we  limit  our  attention  in  this  chapter  to  the  perfor¬ 
mance  of  VLSI  chips. 


112  VLSI  Physical  Models 

Of  all  the  parameters  that  affect  the  performance  of  a  VLSI  chip,  its  area  is  one  of  the  most 
important.  Equally  important  are  the  width  of  and  separation  between  wires,  both  of  which 
are  directly  related  to  area.  Area  is  important  for  two  reasons.  First,  a  larger  area  means  a  chip 
can  have  more  computing  elements  and  do  more  work.  Also,  more  area  means  a  chip  can  have 
more  I/O  ports  to  facilitate  data  movement  on  and  off  the  chip. 

Unfortunately,  the  area  of  a  chip  has  a  practical  limit  due  to  imperfections  that  occur  in  the 
chip  manufacturing  process.  A  single  very  small  piece  of  dust  or  a  dislocation  in  the  crystalline 
semiconductor  substrate,  each  of  which  can  be  large  by  comparison  with  the  dimensions  of 
components,  can  destroy  a  chip.  As  a  consequence,  only  a  small  fraction  (the  yield)  of  the 
chips  resulting  from  a  fabrication  process  work.  The  rest  must  be  discarded. 

The  yield  of  a  chip  is  very  sensitive  to  its  size.  If  the  number  of  faults  per  unit  area  is 
F,  with  very  high  probability  a  fault  occurs  if  the  area  A  of  a  chip  exceeds  1  / F.  As  T1  is 
reduced  by  improvements  in  the  manufacturing  process,  the  area  of  any  one  chip  can  increase. 
However,  if  F  is  fixed,  so  is  the  value  of  A  at  which  an  economical  yield  is  possible.  ( F  has 
not  decreased  much  over  time.)  To  make  chip  manufacture  economical,  dozens  of  chips  are 
manufactured  together  on  a  circular  wafer  of  4  to  8  inches  in  diameter.  The  wafer  is  then  sliced 
into  individual  chips.  If  the  die  size  is  chosen  correctly,  a  fixed  fraction  of  the  chips  on  a  wafer 
will  work.  The  importance  of  testing  becomes  evident  in  light  of  these  observations. 

Because  the  area  of  a  chip  has  a  practical  upper  limit,  the  width  and  separation  of  wires 
determine  the  number  of  components  that  can  be  placed  on  a  chip.  As  mentioned  above,  the 
technology  for  chip  manufacture  places  a  lower  limit  on  these  parameters  as  well  as  the  area  of 
chip  components. 

To  simplify  our  modeling  and  analysis,  we  assume  that  the  minimal  width  and  separation 
of  wires  is  A  (the  minimum  feature  size)  and  that  each  gate,  memory  cell,  port,  and  pair 
of  crossing  wires  has  area  A2.  There  is  no  great  loss  in  assuming  a  single  number  for  wire 
width  and  separation  and  one  number  for  the  minimal  area  of  components  because  in  practice 
the  width  and  separation  of  wires  of  different  kinds  and  the  area  of  components  are  all  small 
multiples  of  common  values.  The  only  component  for  which  these  assumptions  are  weak  is 
the  pads  for  I/O  ports,  which  are  generally  very  much  larger  than  A2.  It  is  important  to  be 
cognizant  of  this  fact  in  drawing  conclusions. 

Since  chips  are  quasiplanar,  we  assume  that  each  chip  has  at  most  v  >  1  layers  on  which 
wires  can  reside  but  that  there  is  only  one  layer  of  gates.  Also,  since  wires  are  rectangular,  it 
is  impractical  for  them  to  meet  at  angles  that  are  not  close  to  0  or  45  degrees.  In  fact,  wires 
are  usually  rectilinear,  that  is,  run  horizontally  and  vertically.  Thus,  we  assume  that  wires  are 
rectilinear. 
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To  complete  the  physical  modeling  of  chips  we  recognize  three  types  of  transmission 
model,  the  synchronous,  transmission-line,  and  diffusion  models.  The  synchronous  model 
assumes  that  one  unit  of  time  is  needed  to  transmit  a  bit  across  a  wire,  independent  of  its 
length.  This  is  a  good  model  when  the  switching  time  of  gates  is  large  by  comparison  with 
the  time  to  transmit  data  through  a  wire  or  when  wires  are  short,  a  situation  that  prevails  for 
most  designs.  When  it  does  not  prevail,  the  unit  of  transmission  time  can  be  increased  so  that 
it  does  apply.  The  transmission-line  model  assumes  that  the  time  to  transmit  a  bit  across  a 
wire  is  proportional  to  its  length  (see  Problems  12.1  and  12.2),  whereas  the  diffusion  model 
assumes  it  is  quadratic  in  its  length.  The  models  apply  to  VLSI  chip  technologies  at  different 
wire  lengths.  The  synchronous,  transmission-line,  and  diffusion  models  apply  to  wires  that  are 
short,  medium-length,  and  long,  respectively. 

Although  we  do  not  examine  energy  consumption  in  this  chapter,  the  type  of  gate  used 
can  have  a  large  impact  on  the  amount  of  energy  consumed  during  a  computation.  NMOS 
transistors  consume  energy  all  the  time,  whereas  CMOS  transistors  consume  energy  only  when 
they  change  their  state. 

When  the  area  of  I/O  pads  and  gates  are  comparable,  the  placement  of  the  pads  on  a  VLSI 
chip  can  have  a  big  impact  on  the  area  occupied  by  a  chip.  For  example,  if  the  chip  realizes  a 
tree  and  its  n  leaves  (and  their  pads)  are  placed  on  the  boundary  of  a  convex  region,  as  noted 
in  Problem  12.3,  the  chip  must  have  area  proportional  to  nlogn.  However,  as  shown  in 
Section  12.5.1,  when  its  leaves  can  be  placed  anywhere,  there  is  a  layout  for  a  tree  (known  as 
the  H-tree)  that  has  area  proportional  to  n.  If  the  I/O  pads  are  much  larger  than  the  gates,  the 
impact  of  their  placement  is  diminished. 

12.3  VLSI  Computational  Models 

We  assume  that  a  VLSI  chip  implements  a  finite-state  machine  instantiated  as  a  clocked  se¬ 
quential  machine.  (A  chip  could  also  model  an  analog  computer  rather  than  a  digital  one,  a 
topic  not  discussed  in  this  book.)  Although  every  FSM  is  eventually  realized  from  two-input 
gates,  binary  memory  cells,  and  wires  carrying  binary  values  (see  Section  3.1),  chips  are  gener¬ 
ally  designed  around  an  aggregate  model  for  data.  That  is,  if  operations  are  done  on  integers, 
the  wires  associated  with  an  integer  travel  together  on  the  chip  surface.  Although  the  time  re¬ 
quired  for  an  operation  on  data  depends  on  the  size  of  alphabet  from  which  the  data  is  drawn 
and  on  the  complexity  of  the  operation  itself,  we  simplify  the  analysis  by  assuming  that  one 
unit  of  time  is  taken.  A  more  sophisticated  analysis  takes  these  factors  into  account. 

To  be  concrete  we  let  the  states  of  an  FSM  be  represented  as  tuples  over  a  set  X  of  binary 
6-tuples.  We  also  assume  that  gates  realize  functions  {h  :  X2  i— >  A'}  and  that  memory  cells 
hold  one  value  of  X .  We  recognize  a  logic  circuit  over  the  set  A'  as  the  graph  of  a  straight-line 
in  which  the  operations  are  drawn  from  a  basis  {h  :  X2  i— >  A}.  This  model  is  used  to  study 
problems  defined  over  non-binary  alphabets,  such  as  matrix  multiplication  and  the  discrete 
Fourier  transform  over  rings. 

We  continue  to  use  the  notation  A  for  the  minimum  feature  size  of  a  VLSI  chip  even 
though  we  now  allow  data  to  be  treated  as  values  in  the  set  X.  When  the  set  X  is  big,  it  will 
be  important  to  make  use  of  its  size  in  accounting  for  the  area  occupied  by  wires  and  gates,  an 
issue  that  we  ignore  in  this  chapter. 

Computation  time  in  the  synchronous  model  is  the  number  of  steps  executed  by  a  chip. 
This  is  the  same  measure  of  time  used  for  finite-state  machines.  Computation  time  in  the 
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other  models  is  the  elapsed  time  in  seconds,  which  is  approximated  by  the  number  of  steps 
multiplied  by  the  length  of  the  longest  step.  This  time  is  generally  a  function  of  the  area  of  the 
chip  and  the  problem  for  which  the  chip  is  designed. 

Another  measure  of  time,  but  one  that  is  given  only  a  cursory  examination,  is  the  period 
P  of  a  VLSI  chip.  This  is  the  time  between  successive  inputs  to  a  pipelined  chip,  one  designed 
to  receive  a  new  set  of  inputs  while  the  previous  inputs  are  propagating  through  it.  Pipelining 
is  illustrated  in  Section  12.5.1  on  H-trees  and  Section  1 1.6  on  block  I/O. 

In  this  chapter  we  assume  that  VLSI  chips  compute  a  single  function  /  :  Xn  >  ■  X m , 
a  perfectly  general  assumption  that  allows  any  FSM  computation  to  be  performed.  While 
this  allows  the  VLSI  chip  to  be  a  CPU  or  a  RAM,  to  convey  ideas  we  limit  our  attention 
to  functions  that  are  simply  defined,  such  as  matrix  multiplication  and  the  discrete  Fourier 
transform. 

The  variables  of  the  function  computed  by  a  VLSI  chip  are  supplied  via  its  I/O  ports.  A 
single  port  can  receive  the  values  of  multiple  variables  but  at  different  time  instances.  Also, 
the  value  of  a  variable  can  be  supplied  at  multiple  ports,  either  in  the  same  time  step  or  in 
multiple  time  steps.  However,  the  outputs  of  a  function  computed  by  a  chip  are  supplied  once 
to  an  output  port.  As  noted  above,  a  port  can  be  either  an  input  or  output  port  or  serve  both 
purposes,  but  not  in  the  same  time  step. 

As  with  the  FSM,  we  cannot  allow  either  the  time  or  the  I/O  port  at  which  data  is  received 
as  input  or  is  supplied  as  output  to  be  data-dependent.  To  do  otherwise  is  to  assume  that  an 
external  agent  not  included  in  the  model  is  performing  computations  on  behalf  of  the  user. 
We  can  expect  misleading  results  if  this  is  allowed.  Thus,  we  assume  that  each  I/O  operation 
is  where-  and  when-oblivious;  that  is,  where  an  input  or  output  occurs  is  data-independent, 
as  are  the  times  at  which  the  I/O  operations  occur. 

For  many  VLSI  computations  it  is  important  that  the  input  data  be  read  once  by  the 
chip  even  if  it  may  be  convenient  to  read  it  multiple  times.  (These  are  called  semellective  or 
read-once  computations.)  For  example,  if  a  chip  is  connected  to  a  common  bus  it  may  be 
desirable  to  supply  the  data  on  which  the  chip  operates  once  rather  than  add  hardware  to  the 
chip  to  allow  it  to  request  external  data.  However,  in  other  situations  it  may  be  desirable  to 
provide  data  to  a  chip  multiple  times.  Such  computations  are  called  multilective.  Multilective 
computations  must  be  where-  and  when-oblivious. 

If  a  multilective  VLSI  algorithm  reads  its  n  input  variables  (3p,n  times  but  only  pn  times 
when  multiple  inputs  of  a  variable  (at  multiple  time  steps)  at  one  I/O  port  are  treated  as  a 
single  input,  then  the  algorithm  is  ((3,  //) -multilective. 


12.4  VLSI  Performance  Criteria 

As  stated  in  Theorem  7.4. 1 ,  the  product  pTp  of  the  time,  Tp,  and  the  number  of  processors,  p, 
in  a  parallel  network  of  RAM  processors  to  solve  a  problem  cannot  be  less  than  the  serial  time, 
Ts,ona  serial  RAM  with  the  same  total  storage  capacity  for  that  problem.  Applying  this  result 
to  the  VLSI  model,  since  the  number  of  processors  of  any  given  size  that  can  be  placed  on  a 
chip  of  area  A  is  proportional  to  A,  it  follows  that  the  product  AT  of  area  with  the  time  T 
for  a  chip  to  complete  a  task  cannot  be  less  than  the  serial  time  to  compute  the  same  function 
using  a  single  processor;  that  is,  AT  =  fl(Ts). 

In  the  next  section  we  show  that  the  matrix-vector  multiplication  and  prefix  functions  can 
be  realized  optimally  with  respect  to  the  AT  measure.  This  holds  because  these  problems  have 
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low  complexity.  For  problems  of  higher  complexity,  such  asnxn  matrix-matrix  multiplication, 
we  cannot  achieve  HT-optimality  because  stronger  lower  bounds  apply.  In  particular,  both 
AT 2  and  A2T  must  grow  as  n4  for  this  problem,  as  we  show.  AT,  AT2  and  A2T  are  the  only 
measures  of  VLSI  performance  considered  in  this  chapter. 

12.5  Chip  Layout 

In  this  section  we  describe  and  discuss  layouts  for  a  number  of  important  graphs  and  problems. 
These  include  balanced  binary  trees,  multi-dimensional  meshes,  and  the  cube-connected  cycle. 

12.5.1  The  H -Tree  Layout 

H-trees  are  embeddings  of  binary  trees  that  use  area  efficiently.  Let  H j.  be  an  H-tree  with  4fc 
leaves.  Figure  12.2  shows  the  H-tree  H2  with  16  darkly  shaded  squares  that  can  be  viewed 
either  as  subtrees  or  leaves.  The  lightly  shaded  regions  are  internal  vertices  of  the  binary  tree. 
Leaves  often  perform  special  functions  that  are  not  performed  by  internal  vertices  whereas 
internal  vertices  of  a  tree  often  perform  the  same  function.  Each  quadrant  of  the  tree  shown  in 
Fig.  12.2  can  be  viewed  as  the  H-tree  H\  on  four  subtrees  or  leaves. 

The  layout  of  Hk  is  recursively  defined  as  follows:  replace  each  of  the  four  leaves  of  Hk- 1 
with  a  copy  of  H\ .  Thus,  H2  in  Fig.  12.2  is  obtained  by  replacing  each  leaf  in  H 1  with  a  copy 
of  H\. 

We  now  derive  an  upper  bound  on  the  area  of  an  H-tree  under  the  assumption  that  each 
vertex  is  square,  leaf  vertices  occupy  area  b2,  and  the  separation  between  leaf  vertices  is  c.  If 
S(k)  is  the  length  of  a  side  of  Hk,  then  S(\)  =  2b  +  c.  Also,  from  the  recursive  construction 
of  Hk  the  following  recurrence  holds: 

S(k)  =  2S(k  —  1)  +  c 


Figure  1 2.2  The  H-tree  H2  containing  16  subtrees  (or  leaves). 
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The  solution  to  this  recurrence  is  S(k)  =  (b  +  c) 2k  —  c  as  the  reader  can  verify.  Since 
Hk  has  n  =  4fc  leaves  and  area  An  =  ( S(k ))2,  it  follows  that  an  n-vertex  H-tree  has  area 
An  <  n(b  +  c)2. 

To  appreciate  the  importance  of  the  H-tree  construction,  observe  that  its  leaves  are  interior 
to  the  layout.  Given  the  usual  drawing  of  a  binary  tree  one  is  tempted  to  place  its  leaves  along 
the  boundary  of  a  chip.  If  this  boundary  is  convex,  the  area  of  a  binary  tree  on  n  leaves  must 
be  at  least  proportional  to  n  log  n.  (See  Problem  12.3.) 


MATRIX-VECTOR  MULTIPLICATION  ON  AN  H-TREE  We  now  describe  an  algorithm  based  on  an 
H-tree  that  multiplies  an  n  X  n  matrix  A  with  an  n-vector  x,  n  =  2k,  by  forming  the  n  inner 
products  of  the  n  rows  of  A  with  x.  (Matrix-vector  multiplication  is  defined  in  Section  6.2.2.) 
This  algorithm  assumes  that  one  unit  of  time  is  taken  to  store  one  piece  of  data  and  to  perform 
an  addition  or  multiplication  on  data. 

On  the  first  time  step  of  our  algorithm  the  components  of  the  vector  x  are  supplied  in 
parallel  to  the  n  leaves  of  the  tree  and  stored  there.  On  the  second  time  step  components  of 
the  first  row  of  A  are  also  provided  in  parallel  to  the  leaves.  In  the  third  time  step  the  product 
of  corresponding  components  of  x  and  the  first  row  of  A  are  multiplied.  In  k  =  log2  n 
additional  time  steps  these  products  are  added  in  the  H-tree  and  the  result  supplied  as  output. 
In  the  next  two  steps  the  second  row  of  A  is  supplied  as  input  and  its  components  multiplied 
by  those  of  x.  After  k  additional  steps  these  products  are  summed  and  the  result  generated 
as  output.  This  process  is  repeated  for  each  of  the  remaining  rows  of  A.  This  algorithm  is 
semellective. 

Since  we  treat  the  time  to  add  and  multiply  as  the  basis  for  measuring  the  time  required 
by  this  H-tree,  each  inner  product  requires  O(logn)  time  and  the  n  inner  products  require 
0(n  log  n)  time.  However,  if  each  addition  vertex  in  this  tree  can  also  store  its  result  (thereby 
causing  a  slight  increase  in  area),  a  new  row  of  A  can  be  supplied  to  the  H-tree  in  each  unit 
of  time  (we  say  the  period  of  the  computation  is  P  =  1)  because  a  series  of  partial  results 
can  move  through  the  tree  in  parallel.  This  is  an  example  of  pipelining.  In  this  case  the  time 
to  perform  the  n  inner  products  is  0(n  +  log  n)  =  0(n).  If  pipelining  is  not  used,  this 
matrix-vector  multiplication  algorithm  does  not  make  the  best  use  of  area  and  time,  as  we  now 
show. 

Even  without  pipelining  there  exists  an  AT  optimal  algorithm  for  matrix-vector  multipli¬ 
cation.  Let  n  be  such  that  n/  log2  n  is  a  power  of  4.  Decompose  each  row  of  A  as  well  as  x 
into  (log2  n) -tuples.  This  is  equivalent  to  representing  the  nxn  matrix  A  by  a  n X  (n/  log2  n) 
matrix  B  whose  entries  are  1  x  log2  n  matrices  (equivalently,  (log2  n) -vectors)  and  to  repre¬ 
senting  x  by  an  (n/  log2  n)-vector  y  whose  components  are  (log2  n)-vectors. 

We  implement  this  computation  on  an  H-tree  with  0{n/  log  n)  area.  To  compute  the 
inner  product  of  As  jth  row  with  x,  sequentially  supply  to  each  H-tree  leaf  the  components 
of  one  (log2  n)-vector  of  y  and  the  corresponding  vector  in  the  jth  row  of  B.  Supply  the 
individual  components  of  these  (log2  n) -vectors  in  alternate  cycles.  After  a  leaf  vertex  receives 
the  corresponding  components  of  A  and  x,  it  multiplies  them  and  adds  the  result  to  its  running 
sum.  Upon  completion  of  an  inner  product  of  two  (log2  n) -vectors,  the  leaf  vertices  make  their 
values  available  to  be  added  in  the  H-tree  in  O(logn)  steps.  After  n  of  these  operations,  all  n 
inner  products  of  Ax  are  computed. 

This  algorithm  uses  T  =  O(nlogn)  time  but  only  has  area  A  =  0(n/  log  n).  Thus, 
its  area-time  product  satisfies  AT  =  0(n2),  which  is  optimal  since  each  of  the  n2  +  n 
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components  of  A  and  x  must  be  read.  This  algorithm  is  multilective  because  it  supplies  each 
component  of  x  n  times. 


PREFIX  COMPUTATION  ON  AN  H-TREE  The  H  -tree  is  also  an  effective  way  to  do  a  prefix  com¬ 
putation.  Prefix  computations  (let  ©  be  the  associative  operator)  are  naturally  executed  on 
trees.  A  tree-based  prefix  computation  is  described  in  Problem  7.31.  One  datum  enters  the 
root  of  the  tree;  the  rest  travel  up  from  the  leaves.  When  implemented  on  an  H-tree,  this  algo¬ 
rithm  uses  area  O(n)  on  n  inputs  and  time  O(logn),  giving  an  AT  product  of  0(n  log  n). 
This  algorithm  is  semellective. 

This  algorithm  can  be  converted  into  an  AT -optimal  algorithm  using  a  technique  similar 
to  that  used  above.  We  subdivide  the  input  n-tuple  x  into  (log2  n) -tuples,  of  which  there  are 
(n/  log2  n),  and  serially  form  the  associative  combination  of  the  (log2  n)  components  of  each 
tuple  using  ©  in  (log2  n)  steps.  We  then  perform  the  prefix  computation  on  these  (n/  log2  n) 
results.  To  complete  the  computation,  for  1  <  j  <  (n/log2  n)  —  1  we  reread  each  of  the 
original  (log2  n)-tuples  in  parallel  and  add  the  (j  —  l)st  result  (the  zeroth  result  is  0)  to  the 
first  component  of  the  jth  (log2  n) -tuple,  and  then  serially  perform  a  prefix  computation  on 
these  new  (log2  n) -tuples. 

We  increase  (n/  log2  n)  to  the  next  power  of  4  (adding  inputs  whose  corresponding  out¬ 
puts  are  ignored)  and  embed  the  tree  of  Fig.  7.23  directly  into  an  H-tree.  The  initial  associative 
combination  of  (log2  n) -tuples  and  the  final  prefix  computation  on  (log2  n) -tuples  are  done 
at  vertices  of  the  H-tree  that  are  I/O  vertices  of  the  prefix  tree.  This  algorithm  takes  time 
0(log  n)  on  the  initial  and  final  phases  as  well  as  on  the  prefix  computation.  Since  the  area  of 
the  layout  is  0(n/  log2  n)  and  every  one  of  the  n  inputs  must  be  read,  its  area-time  product, 
AT,  is  0(n)  which  is  optimal.  This  algorithm  is  multilective  since  each  input  is  supplied 
twice. 


12.5.2  Multi-dimensional  Mesh  Layouts 

As  explained  in  Section  7.5,  many  important  problems  can  be  solved  with  systolic  arrays.  If 
the  cells  of  one-  and  two-dimensional  systolic  arrays  are  of  fixed  size  and  quasiplanar,  they  can 
be  embedded  directly  onto  a  chip  with  area  proportional  to  the  number  of  cells.  Applying  the 
results  of  Theorems  7.5.1,  7.5.2,  and  7.5.3  we  have  the  following  facts  concerning  the  area  and 
time  for  three  important  problems  when  realized  by  such  arrays. 


Problem 

Dimensions 

Area 

Time 

n  x  n  Matrix- Vector  Multiplication 

ID 

O(n) 

0(n) 

Bubble  Sort  of  n  items 

ID 

O(n) 

0(n) 

Batcher’s  Odd-Even  Sorting  of  n  items 

ID 

0(n ) 

O(n) 

\fn  X  y/n  Matrix-Matrix  Multiplication 

2D 

O(n) 

O(Vn) 

Fully  normal  algorithms  for  problems  such  as  shifting,  summing,  broadcasting,  and  fast 
Fourier  transform  on  n  =  2ld  inputs  can  each  be  done  in  O(logn)  steps  on  the  n-vertex  hy¬ 
percube  or  the  canonical  cube-connected  cycles  network  on  n  vertices.  From  Theorems  7.7.4 
and  7.7.5  these  problems  can  also  be  solved  in  0(n)  and  0(y/n)  steps,  respectively,  on  n- 
vertex  one-  and  two-dimensional  systolic  arrays.  We  summarize  these  facts  in  Figure  12.3. 
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Problem 

Dimensions 

Area 

Time 

Shifting  of  n-vector 

ID 

0(n) 

0(n) 

2D 

Q{n ) 

O(fyn) 

Summing  n  items 

ID 

Q{n) 

0(n) 

2D 

0{n) 

O(fyn) 

Broadcasting  to  n  locations 

ID 

0(n) 

0(n) 

2D 

Q(n ) 

O(fyn) 

n-point  FFT 

ID 

0(n ) 

Q{n ) 

2D 

Q(n ) 

O(fyn) 

Figure  1 2.3  Area  vs.  time  performance  of  VLSI  algorithms  for  four  problems. 


In  Section  12.6  we  show  that  shifting  of  an  n-vector,  the  n-point  FFT,  and  n  x  n  matrix- 
matrix  multiplication  each  require  area  A  and  time  T  satisfying  AT2  =  f 2(n2).  Consequently, 
the  2D  algorithms  cited  above  for  these  problems  are  optimal  to  within  a  constant  factor. 

In  the  next  section  we  now  show  that  every  normal  algorithm  can  be  implemented  on 
the  cube-connected  cycles  (CCC)  network  in  time  T  satisfying  12  (log  n)  <  T  <  O(fyn) 
and  that  the  CCC  network  can  be  embedded  in  the  plane  using  area  A  =  0(n2 /T2).  In 
Theorems  12.7.2  and  12.7.3  we  show  that  these  implementations  are  optimal  up  to  constant 
multiplicative  factors  with  respect  to  area  and  time  for  the  three  problems  mentioned  above. 

12.5.3  Layout  of  the  CCC  Network 

In  Section  7.7.6  we  describe  the  realization  of  a  fully  normal  algorithm  on  the  canonical  CCC 
network.  The  realization  extends  directly  from  the  canonical  CCC  network  to  a  general  ( k ,  d)- 
CCC  network  in  which  there  are  2d  cycles  and  2k  vertices  on  each  cycle.  (See  Fig.  12.4.) 

A  fully  normal  algorithm  is  simulated  on  the  CCC  network  by  giving  the  processors  on 
the  jth  cycle,  0  <  j  <  2d  —  1,  the  addresses  i  +  j 2k  where  0  <  i  <  2k  —  1.  The  cycles 
are  treated  as  ID  arrays  and  used  to  simulate  a  normal  algorithm  on  the  first  k  dimensions 
exactly  as  is  done  in  Section  7.7.6.  These  simulations  are  done  in  parallel  after  which  the 
swaps  across  the  higher-order  d  dimensions  are  simulated  by  first  rotating  the  leading  element 
on  each  cycle  to  the  first  of  the  inter-cycle  edges.  After  executing  one  swap,  each  cycle  is 
advanced  one  step  so  that  the  second  elements  on  each  cycle  are  aligned  with  the  first  of  the 
high-order  dimensions.  At  this  point  the  first  elements  on  each  cycle  are  aligned  with  the  edge 
associated  with  the  second  of  the  high-order  dimensions.  Thus,  while  swaps  are  done  between 
the  second  elements  on  each  cycle  across  the  first  of  the  high-order  dimensions,  swaps  occur 
between  leading  elements  along  the  second  of  the  high-order  dimensions.  This  rotating  and 
swapping  is  done  until  all  cycle  elements  have  been  swapped  across  all  high-order  dimensions. 

This  algorithm  performs  0( 2k)  steps  on  the  cycles  to  perform  swaps  across  low-order 
dimensions  and  align  the  cycles  for  swaps  at  higher  dimensions.  An  additional  O(d)  steps  are 
used  to  perform  swaps  on  the  d  high-order  dimensions.  Thus,  the  number  of  steps  used  by 
this  algorithm,  T,  satisfies  T  =  0{2k  +  d).  The  number  of  processors  used  in  (k,d)- CCC 
network,  n,  satisfies  n  =  2d+k . 
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•  ^  k 


Figure  1 2.4  An  embedding  of  a  ( k ,  d)- CCC  network  in  the  plane  for  k  =  3  and  d  =  4.  The 
2d  columns  represent  cycles  of  length  2k  >  d.  For  I  <  j  <  d,  the  jth  vertex  on  each  cycle  is 
connected  to  the  j  th  vertex  on  another  cycle. 


Figure  12.4  shows  a  layout  of  a  (3, 4)-CCC  network.  A  layout  for  a  general  (k,  d)- CCC 
network,  2k  >  d,  can  be  developed  following  this  pattern.  Place  each  cycle  of  length  2k  in 
a  column.  Use  2d  —  1  rows  to  make  connections  between  columns.  These  rows  are  divided 
into  d  sets.  The  first  set,  consisting  of  one  row,  connects  adjacent  columns.  The  second 
set,  containing  two  rows,  connects  every  other  column.  The  jth  set,  containing  2J~l  rows, 
connects  every  2Jth  column.  The  number  of  rows  used  for  these  connections  is  1  +  2  +  4  + 

•  •  •  +  2d~*  =  2d  —  1.  Since  d  processors  are  used  in  each  column  to  make  these  connections, 
each  column  contains  2k  —  d  >  0  processors  not  connected  to  other  columns.  (These  are 
suggested  by  the  lightly  shaded  vertices.)  It  follows  that  this  layout  has  2d  +  2k  —  (d  +  1)  rows 
and  2d+1  columns.  If  a  wire  is  assumed  to  have  the  same  width  as  a  processor,  the  layout  has 
area  A  =  2d+\2d  +  2k  -  (d  +  1)). 

Recall  that  n  =  2d+k  and  2fc  >  d  or  k  >  log2  d.  It  follows  that  T  =  0(2 k  +  d)  =  0(2fc). 
Since  k  >  log2  d,T=  12(d)  =  f2(log  n).  Also,  when  k  <  d,  2lk  <  n  and  T  <  0(y/n).  We 
summarize  this  result  below. 

THEOREM  12.5.1  Every  fully  normal  algorithm  for  a  n-processor  hypercube  can  be  implemented 
on  a  CCC  network  whose  VLSI  layout  has  area  A  and  uses  time  T  satisfying  the  following  bound 
for  f2(logn)  <  T  =  0(1/11,). 


AT2  =  0(n2) 
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This  result  can  be  applied  to  any  of  the  fully  normal  algorithms  described  in  Section  7.6 
and  the  Benes  permutation  network  discussed  in  Section  7.8.2. 


12.6  Area-Time  Tradeoffs 

The  AT 2  measure  encountered  in  the  last  section  is  fundamental  to  VLSI  computation.  This 
is  established  by  deriving  a  lower  bound  on  AT 2  in  terms  of  the  planar  circuit  complexity, 
Cp.n  (/)>  of  the  function  /  computed  by  a  VLSI  chip  of  area  A  in  T  steps.  A  similar  result  is 
derived  for  the  product  A2T.  The  planar  circuit  size  of  /  is  the  size  of  the  smallest  memoryless 
planar  circuit  for  /.  The  measures  AT2  and  A2T  are  the  sizes  of  two  different  memoryless 
planar  circuits  that  compute  the  same  mapping  from  inputs  to  outputs  as  a  VLSI  chip  of  area 
A  that  executes  T  steps. 

12.6.1  Planar  Circuit  Size 

We  now  formally  define  planar  circuit  size  and  show  how  it  relates  to  the  standard  circuit  size 
measure. 

DEFINITION  12.6.1  A  planar  circuit  over  the  set  X  is  a  logic  circuit  over  the  set  X  that  has  been 
embedded  in  the  plane  in  such  a  way  that  gates  do  not  overlap  but  edges  may  cross.  A  planar  circuit 
is  semellective  if  there  is  a  unique  vertex  at  which  each  input  variable  is  supplied.  Otherwise,  the 
planar  circuit  is  multilective. 

The  size  of  a  planar  circuit  is  the  number  of  inputs,  edge  crossings,  and  gates  drawn  from 
a  basis  f l  =  {h  :  X2  i— >  X}  that  the  circuit  contains.  The  planar  circuit  size  of  a  function 
f  :  Xn  i— >  Xm  over  f l,  CPiq  (/),  is  the  size  of  the  smallest  planar  circuit  for  f  over  the  basis  fi. 

A  multilective  circuit  of  order  p,  p  >  1,  for  a  function  f  :  Bn  i— >  Bm  has  pn  input  vertices. 
The  size  of  the  smallest  multilective  planar  circuit  of order  p  for  f  is  denoted  q  (/) .  If  the  planar 

circuit  is  semellective,  the  planar  circuit  size  of  f  is  denoted  C^^f)  or  CPin(/)  when  confusion 
is  not  likely. 

Every  binary  function  has  a  planar  circuit.  To  see  this,  observe  that  every  function  has  a 
circuit,  which  is  a  graph,  and  that  every  graph  has  a  planar  embedding  with  edge  crossings. 
The  planar  circuit  size  of  a  function  is  at  worst  quadratic  in  its  standard  circuit  size,  as  we  now 
show. 

LEMMA  1 2.6. 1  The  (multilective)  planar  circuit  and  standard  size  of  f  :  Bn  i— >  Bm  relative  to 
the  basis  Q  are  in  the  following  relationship  where  r  is  the  fan-in  of  ft. 

Cn(f)  +  n<  Cp'Ci(f)  <  r2C2n(f)/2  +  C7n(/)  +  n 

Proof  The  first  inequality  follows  because  the  planar  circuit  size  measure  includes  inputs, 
crossings,  and  gates,  whereas  the  circuit  size  measure  includes  only  gates. 

Consider  an  embedding  of  a  standard  circuit  for  /  containing  Cq(/)  gates.  In  such 
an  embedding  it  is  not  necessary  for  any  two  edges  to  intersect  more  than  once  because  if 
they  violate  this  condition  the  edge  segments  between  any  two  successive  crossings  can  be 
swapped  so  that  these  two  crossings  can  be  eliminated.  Since  every  gate  has  at  most  r  inputs, 
a  minimal  standard  circuit  for  /  has  at  most  rCn(f)  edges  connecting  gates.  It  follows  that 
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(a) 


(b) 


Figure  1 2.5  Two  simulations  of  a  T-step  VLSI  chip  computation  by  a  planar  circuit. 


the  number  of  crossings  does  not  exceed  r2Cn(f)2 /2  because  there  are  at  most  ©  ways  of 
forming  pairs  drawn  from  a  set  of  size  q  and  q  =  rCfi(f).  Combining  this  with  the  number 
of  inputs  and  gates,  we  have  the  desired  upper  bound.  ■ 

In  Section  12.7  we  show  that  nearly  meets  the  upper  bound  of  Lemma  12.6. 1 .  That 

is,  the  planar  circuit  size  of  this  function  is  nearly  quadratic  in  its  standard  circuit  size. 

12.6.2  Computational  Inequalities 

We  now  show  that  every  VLSI  chip  computation  can  be  simulated  by  planar  circuits  of  size 
0(AT2)  and  0(A2T).  The  simulation  is  patterned  on  the  simulations  of  Chapter  3;  that  is, 
the  loop  that  constitutes  the  computation  by  the  chip  with  memory  is  unwound  to  create  a 
planar  circuit.  Instead  of  passing  the  outputs  of  the  next-state/ output  circuit  to  binary  memory 
cells  they  are  passed  to  another  copy  of  the  circuit. 

Figure  12.5  shows  two  simulations  of  a  T-step  VLSI  chip  computation  by  a  planar  circuit. 
The  first  is  obtained  by  placing  T  copies  of  the  chip  one  above  the  other  and  supplying  the 
state  output  of  one  copy  to  the  state  input  of  the  next  copy.  The  second  is  simulated  by  placing 
T  copies  of  the  chip  side  by  side  and  running  wires  from  the  state  output  of  one  chip  to  the 
state  input  of  the  next.  We  convert  each  of  these  memoryless  circuits  to  planar  circuits  and 
bound  the  number  of  inputs,  crossings  and  gates  they  contain.  Recall  that  we  assume  that 
wires  are  rectilinear;  that  is,  they  run  only  horizontally  and  vertically. 

Since  the  number  of  wire  layers  on  a  single  chip  is  bounded,  it  does  not  hurt  to  assume 
that  the  centerlines  of  parallel  wires  on  different  planes  are  displaced  slightly.  (It  is  bad  practice 
to  overlap  wires  because  one  wire  can  induce  currents  in  the  other.)  Now  make  the  width  of 
wires  and  the  area  of  gates  infinitesimal.  (Wires  are  shrunk  to  their  centerline.)  As  shown  in 
Fig.  12.6(a),  each  two-input  gate  is  replaced  by  an  infinitesimal  vertex  connected  by  a  straight- 
line  to  its  output  and  the  two  connections  from  its  inputs  are  made  by  wires  that  contain  bends 
(two  wires  touch).  This  converts  a  single  chip  to  a  planar  graph  with  wires  that  touch  or  cross. 
(See  Fig.  12.6(b)  and  (c)). 

We  now  bound  nw,  the  number  of  wires,  and  ng,  the  number  of  gates  on  a  chip  of  area 
A.  Since  each  wire  has  width  A  and  length  at  least  A  and  each  gate  occupies  area  A2,  nw  and 
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(a)  (b)  (c) 

Figure  12.6  (a)  The  result  of  shrinking  a  physical  gate  to  a  point,  (b)  A  crossing  of  two  wires, 
and  (c)  four  types  of  connection  between  two  wires. 


rig  satisfy  the  following  bounds. 

riw  <  A/  A2 
ng  <  A/X 2 

Because  each  point  of  crossing  or  touching  of  wires  occupies  area  at  least  A2,  the  number 
of  points  at  which  wires  cross  and  touch  on  each  of  the  v  layers  of  a  chip  that  has  area  A  is 
at  most  A/\2.  As  shown  in  Fig.  12.6(a),  when  gates  are  made  infinitesimal  two  additional 
bends  are  created  at  the  point  at  which  the  output  wire  touches  the  gate.  This  can  be  viewed 
as  adding  four  wire  bends  per  gate.  Since  the  number  of  gates  is  at  most  A/  A2,  we  have  the 
following  bound  on  ncr,  the  number  of  wire  crossings  and  touchings. 

ncr  ^  {y  T  A)  A/X2 

Consider  the  first  of  the  two  simulations.  T  layers  of  one  chip  are  placed  one  above  the 
other.  To  expose  overlapping  wires,  displace  all  layers  to  the  northeast  by  an  infinitesimal 
amount.  Every  pair  of  wires  that  cross  or  meet  has  the  potential  to  introduce  crossings,  as 
suggested  in  Fig.  12.7(a)  and  (b).  The  maximum  number  of  crossings  that  can  be  introduced 
per  touching  or  crossing  of  wires  is  T 2 .  Since  the  number  of  input  vertices  is  O(AT),  this 
provides  an  upper  bound  of  0(AT2)  on  the  number  of  inputs,  gates,  and  crossings  of  the 
resultant  planar  circuit. 

Now  consider  the  second  simulation.  T  copies  of  one  chip  are  laid  side-by-side  and  the 
layout  of  each  chip  opened  and  at  most  nw  parallel  wires  inserted  to  make  connections  to 
adjacent  chips.  Since  there  are  nw  wire  segments  on  a  single  chip,  at  most  new  wire 
crossings  are  introduced  on  one  chip.  Thus,  the  number  of  inputs,  gates,  and  crossings  in  this 
layout  is  0(AT  +  n^T)  =  0(A2T). 

The  following  theorem,  which  is  an  application  of  Theorem  3.1.1  to  the  VLSI  model, 
summarizes  the  above  results.  It  makes  use  of  the  fact  the  planar  circuit  size  of  a  function 
/  computed  by  a  VLSI  chip  of  the  kind  described  above  is  no  larger  than  that  of  the  planar 
circuits  just  constructed.  This  theorem  demonstrates  the  importance  of  the  measures  AT2  and 
A2T  as  characterizations  of  the  complexity  of  VLSI  computations.  It  also  shows  that  lower 
bounds  on  the  performance  of  VLSI  chips  can  be  obtained  in  terms  of  the  planar  circuit  size 
of  the  functions  computed  by  them. 
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(a)  (b) 

Figure  12.7  Crossings  obtained  by  translating  infinitesimally  to  the  northeast  T  copies  of  (a) 
one  crossing  and  (b)  the  four  possible  connections  between  two  wires. 


THEOREM  12.6. 1  Let  /©  be  the  function  computed  by  a  VLSI  chip  that  realizes  the  FSM  M 
in  T  steps.  The  planar  circuit  size  over  a  basis  LI  =  {h  :  X2  i— >  X}  of  any  function  f  computed 
by  M  in  T  steps  satisfies  the  following  inequalities: 

<W/)  =  0{AT2) 

<W/)  =  o(a2t ) 

If  M  is  multilective  of  order  p,  then  CPiq(/)  is  replaced  by 

It  is  important  to  note  that  these  relationships  between  planar  circuit  size  and  the  mea¬ 
sures  AT2  and  A2T  hold  for  all  functions  computed  by  VLSI  algorithms,  both  multi-output 
functions  and  predicates. 

In  the  next  section  we  develop  the  planar  separator  theorem  that  is  used  in  the  next  section 
to  derive  lower  bounds  on  the  planar  circuit  size  of  important  problems. 

12.6.3  The  Planar  Separator  Theorem 

The  planar  separator  theorem  applies  to  graphs  G  =  ( V ,  E )  for  which  a  non-negative  cost 
function  c  is  defined  on  V.  The  cost  of  V,  denoted,  c(V),  is  the  sum  of  the  costs  of  every 
vertex  in  V.  The  theorem  states  that  the  vertices  of  every  planar  graph  G  on  N  vertices  can  be 
partitioned  into  three  sets,  A,  B,  and  C  such  that  no  edge  connects  a  vertex  in  A  with  one  in 
B,  the  cost  of  vertices  in  A,  c(A),  and  those  in  B,  c(B),  satisfy  c(A),  c(B )  <  2c(V)/3  and 
C  contains  at  most  4y/N  vertices. 

The  following  lemma  uses  the  concept  of  the  spanning  tree  of  a  graph,  a  tree  that  contains 
every  vertex  of  a  connected  graph  G.  It  shows  the  existence  of  a  cycle  that  divides  a  planar  graph 
into  an  “inside”  and  an  “outside”  containing  about  the  same  number  of  vertices.  The  radius 
of  a  rooted  spanning  tree  is  the  number  of  edges  on  the  longest  path  from  the  root  to  a  vertex. 
(See  Problem  12.8  for  an  illustration  of  the  following  lemma.) 
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LEMMA  12.6.2  Let  G  =  (V ,  E)  be  a  finite  connected  planar  graph.  Let  c  be  a  non-negative 
cost  function  defined  on  V  and  let  c(V)  be  the  total  cost  of  all  vertices  in  V .  If  G  has  a  rooted 
spanning  tree  of  radius  r,  then  V  can  be  partitioned  into  sets  A,  B,  and  C  such  that  c(  A) ,  c(B)  < 
2  c(V)  /3,  no  edge  joins  a  vertex  of  A  with  one  of  B,  and  C  contains  at  most  2r  +  1  vertices. 

Proof  Since  the  lemma  is  true  if  the  cost  of  any  vertex  exceeds  1/3,  assume  the  converse.  Let 
G  =  ( V ,  E)  be  embedded  in  the  plane.  A  face  of  a  planar  graph  is  a  region  bounded  by 
vertices  and  edges  that  does  not  contain  any  other  vertices  and  edges.  The  external  face  of  a 
finite  planar  graph  is  the  face  of  unbounded  area.  Since  G  is  finite,  it  has  an  external  face.  A 
triangular  planar  graph  is  a  planar  graph  in  which  each  face  is  a  triangle.  If  a  planar  graph 
is  not  triangular,  it  can  be  made  triangular  by  choosing  one  vertex  on  the  boundary  of  each 
face  and  adding  an  edge  between  it  and  every  other  vertex  on  this  face  to  which  it  does  not 
already  have  an  edge.  Without  loss  of  generality  we  assume  that  G  is  triangular. 

Let  T  be  the  spanning  tree  of  radius  r  postulated  in  the  lemma.  Each  edge  e  in  E  not 
on  T  defines  a  unique  cycle  £(e)  of  length  at  most  2r  +  1.  The  cycle  divides  V  into  three 
sets,  vertices  on  £(e),  and  vertices  on  each  side  of  £(e).  Let  Ci(e)  and  02(e)  be  the  cost  of 
vertices  on  either  side.  (The  side  with  the  larger  cost  is  called  the  inside  of  the  cycle.)  We 
claim  that  for  some  e  not  on  T  the  larger  of  Ci(e)  and  02(e)  is  more  than  2c(V)/3.  We 
suppose  the  larger  is  no  more  than  2c(  V) /3  and  establish  a  contradiction. 

Let  e  =  [x,  y)  be  an  edge  not  on  T  such  that  /i(e)  =  max(ci  (e),  02(e))  is  smallest  and 
for  all  other  e*  such  that  /z(e*)  =  p(e)  the  inside  of  £(e)  has  the  fewest  faces.  In  case  of 
ties,  let  e  be  chosen  arbitrarily.  We  show  the  assumption  that  fx(e)  >  2 c(V) /3  is  false. 

Consider  the  triangle  containing  the  edge  e  =  ( x ,  y)  on  the  side  of  the  cycle  £ (e)  that 
has  largest  cost.  Let  z  be  the  third  vertex  in  this  triangle,  z  is  on  the  spanning  tree  because 
every  vertex  is  on  the  tree.  We  consider  two  cases  for  z:  (a)  either  edge  ( x ,  z )  or  ( y ,  z )  is  in 
T  and  (b)  neither  edge  is  in  T. 

In  case  (a)  without  loss  of  generality,  let  ( y ,  z)  be  in  T.  There  are  two  subcases  to 
consider:  (al)  z  is  on  £(e)  (see  Fig.  12.8(a))  and  (a2)  it  is  not  on  £(e)  (see  Fig.  12.8(b)).  In 
(al)  the  edge  e!  =  ( x ,  z )  cannot  be  a  tree  edge  since  T  contains  no  cycles  unless  the  cycle 
consists  of  just  the  vertices  x,  y,  and  z,  which  is  impossible  since  the  inside  of  £(e)  contains 


£(e)  £(e) 


Figure  12.8  A  non-tree  edge  e  =  (x,y)  in  a  triangular  planar  graph  with  spanning  tree  T 
defines  a  cycle  5(e).  The  triangle  containing  e  on  the  larger  side  of  fie)  contains  a  third  vertex 
z.  In  (a)  and  (b)  ( y ,  z)  is  on  T,  whereas  in  (c)  neither  ( x ,  z)  nor  (y,  z)  is  on  T.  In  (a)  (y,  z)  is 
on  ^(e),  whereas  in  (b)  it  is  not. 
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at  least  one  vertex.  But  £(e7)  includes  the  same  set  of  vertices  of  V  inside  it  (and  has  the 
same  cost)  as  does  £(e),  although  it  has  fewer  faces,  contradicting  the  choice  for  e  =  ( x ,  y). 

In  case  (a2)  the  edge  e!  =  ( x ,  z)  is  a  non-tree  edge  since  T  contains  no  cycles.  The  inside 
of£(e')  contains  no  more  cost  and  one  less  face  than  £(e).  If  the  cost  inside  £(e7)  is  greater 
than  the  cost  outside,  e'  would  have  been  chosen  instead  of  e.  On  the  other  hand,  if  the 
cost  inside  £(e7)  is  at  most  the  cost  outside,  since  the  latter  is  equal  to  the  cost  outside  £(e), 
which  is  at  most  c(V) / 3,  the  cost  inside  £(e7)  is  at  most  c(V ) /3.  However,  this  contradicts 
the  assumption  that  >  2c{V)/3  for  all  edges  e*. 

Consider  the  case  (b)  in  which  neither  edge  ( x ,  z)  nor  ( y ,  z)  is  in  T.  (See  Fig.  12.8(c).) 
The  edges  ( x ,  z)  and  (y,  z)  each  define  a  cycle  contained  within  £(e).  Without  loss  of  gen¬ 
erality  assume  that  the  cycle  defined  by  ( x ,  z)  has  more  cost  on  the  inside  of  £ (e)  than  does 
the  cycle  defined  by  ( y ,  z).  Because  the  cost  of  vertices  on  the  inside  of  the  original  cycle  is 
more  than  2c(V)/3,  the  cost  inside  and  on  £((z>z))  is  more  than  c(V) /3.  Thus,  the  cost 
outside  £(( x,  z))  is  less  than  or  equal  to  2c(V)/3.  If  the  cost  inside  £((a:,  z))  is  also  less 
than  or  equal  to  2c(V)/3,  we  have  a  contradiction.  If  greater  than  2c(V)/3,  £((x,  z))  is  a 
cycle  with  fewer  faces  for  which  y({x,  z))  >  2c(V) /3,  another  contradiction.  ■ 

The  following  theorem  uses  Lemma  12.6.2  together  with  a  spanning  tree  constructed 
through  a  breadth-first  traversal  of  a  connected  planar  graph  to  show  the  existence  of  a  small 
separator  that  divides  the  vertices  into  approximately  two  equal  cost  parts. 

THEOREM  12.6.2  Let  G  =  (V,E)  be  an  N -vertex  planar  graph  having  non-negative  vertex 
costs  summing  to  c(V).  Then,  V  can  be  partitioned  into  three  sets,  A,  B,  and  C,  such  that  no  edge 
joins  vertices  in  A  with  those  in  B,  neither  A  nor  B  has  cost  exceeding  2c(  V)  / 3,  and  C  contains 
no  more  than  4 \/N  vertices. 

Proof  We  assume  that  G  is  connected.  If  not,  embed  it  in  the  plane  and  add  edges  as 
appropriate  to  make  it  connected.  Assume  that  it  has  been  triangulated,  that  is,  every  face 
except  for  the  outermost  is  a  triangle. 

Pick  any  vertex  (call  it  the  root)  and  perform  a  breadth-first  traversal  of  G.  This  traversal 
defines  a  BFS  spanning  tree  T  of  G.  A  vertex  v  has  level  d  in  this  tree  if  the  length  of  the 
path  from  the  root  to  v  has  d  edges.  There  are  no  vertices  at  level  q  where  q  is  the  level  one 
larger  than  that  of  all  vertices.  Let  Rd  be  the  vertices  at  level  d  and  let  =  If?© 

The  reader  is  asked  to  show  that  there  is  some  level  to  such  that  the  cost  of  vertices 
at  levels  below  and  above  to  each  is  at  most  c(V)/2.  (See  Problem  12.9.)  Let  l  and  h, 
l  <  m  <  h,  be  levels  closest  to  to  that  contain  at  most  y/N  vertices.  That  is,  ri  <  y/N  and 
Th  <  y/N .  There  are  such  levels  because  level  0  contains  a  single  vertex  and  there  are  none 
at  level  q. 

The  vertices  in  G  are  partitioned  into  the  following  five  sets:  a)  L  =  lJd<;  2?d>  b)  Ri, 
c)  M  =  U; <:d<h^d’ d)  -R/i»  an<i  e)  H  =  U k<d^d-  Since  L  and  H  are  subsets  of  the 
sets  of  vertices  with  levels  less  than  and  more  than  m,  c(L),c(H)  <  c(V) / 2.  Also,  by 
construction,  ri,rh  <  s/N.  If  Ri  =  Rh  =  Rm  (which  implies  that  M  is  empty  and 
l  =  h  =  to),  let  A  =  L,  B  =  H,  and  C  =  Ri  =  Rh-  Then,  C  is  a  separator  of  size  at 
most  y/N  and  the  theorem  holds.  If  l  h,  then  h  —  l—  1  >0.  Since  each  of  the  h  —  l—  1 
levels  between  l  and  h  has  at  least  y/N  +  1  vertices,  it  follows  that  h  —  l  —  1  <  y/N  -  1 
because  these  levels  cannot  have  more  than  N  —  1  vertices  altogether. 

Consider  the  subgraph  of  G  consisting  of  the  vertices  in  M  and  the  edges  between  them. 
Add  a  new  vertex  vq  to  replace  the  vertices  in  L  U  Ri  and  add  an  edge  from  vq  to  each  of 
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the  vertices  at  level  l  +  1 .  This  operation  retains  planarity  and  the  resulting  graph  remains 
triangulated  because  adjacent  vertices  on  Ri+\  have  an  edge  between  them.  Also,  it  defines  a 
spanning  tree  T*  consisting  of  Vo,  the  new  edges,  and  the  projection  of  the  original  spanning 
tree  to  the  vertices  in  M.  T*  has  radius  at  most  y/N. 

Apply  Lemma  12.6.2  to  T*  giving  Vo  zero  cost.  This  lemma  identifies  three  sets  of 
vertices,  A0,  Bq  and  Co,  from  which  we  delete  Vq  and  adjacent  edges.  Since  c(M)  <  c(V), 
it  follows  that  there  are  no  edges  between  vertices  in  Aq  and  Bo,  c(Aq),  c(Bq)  <  2c(U) / 3, 
and  | Co |  <  2 y/~N.  Let  C  =  Co  U  Ri  U  Rh .  It  follows  that  |C|  <  4y/ N. 

Each  of  the  four  sets  Aq,  Bo,  L,  and  H  has  cost  at  most  2c(V)/3.  If  any  one  of  them 
has  cost  more  than  c(V)  / 3,  let  it  be  A  and  let  B  be  the  union  of  the  remaining  sets.  If  none 
of  them  has  cost  more  than  c(V) /3  vertices,  order  the  sets  by  size  and  let  A  be  the  union  of 
the  fewest  of  these  sets  whose  cost  is  at  least  c(V) /3  vertices.  This  procedure  insures  that  A 
has  cost  between  c(V)/3  and  2c(V)/3  which  implies  that  B  satisfies  the  same  condition  as 
A  and  the  theorem  is  established.  ■ 

The  preceding  version  of  the  planar  separator  theorem  only  guarantees  that  the  vertices  of  a 
planar  graph  are  divided  into  two  sets  whose  costs  are  nearly  balanced  and  a  small  separator.  It 
does  not  insure  that  the  number  of  vertices  in  the  two  sets  are  balanced.  The  following  lemma 
remedies  this  situation.  We  leave  its  proof  to  the  reader.  (See  Problem  12.10.) 

LEMMA  1 2.6.3  Let  G  =  (V ,  E)  be  an  N -vertex planar  graph  having  non-negative  vertex  costs 
summing  to  c(V).  Then  V  can  be  partitioned  into  three  sets.  A,  B,  and  C,  such  that  no  edge  joins 
vertices  in  A  with  those  in  B,  neither  A  nor  B  has  cost  exceeding  7c(V) /9,  |j4|,  \B\  <  5 N/ 6, 
and  C  contains  no  more  than  K\  yfN  vertices,  where  K i  =  4(  y/2/3  +  1 ) . 

This  new  result  can  be  applied  to  show  that  the  vertices  of  a  planar  graph  can  be  partitioned 
into  many  sets  each  having  about  the  same  cost  and  such  that  a  small  set  of  vertices  can  be 
removed  to  separate  each  set  from  all  other  sets.  This  result  is  also  left  to  the  reader.  (See 
Problem  12.1 1.) 

LEMMA  1 2.6.4  Let  G  =  (V,  E)  be  an  N -vertex  planar  graph  and  let  c  be  a  non-negative  cost 
function  on  V  with  total  cost  of  c(V).  Let  P  >  2.  There  are  constants  2P/3  <  q  <  3 P  and 
K2  =  4{y/2/3  +  1)/(1  —  -y/5/6)  such  that  V  can  be  partitioned  into  q  sets,  A\,  A2, . . . ,  Aq 
such  that  for  l  <  i  <  q 


c(V)/(3P)  <  c{Ai )  <  3c(V)/(2P) 

and  there  are  sets  Ci,  \Ci  \  <  K2VN,  and  Bi  =  V  —  Ai  —  Ci  such  that  no  edges  join  vertices  in 
Ai  with  vertices  in  Bi. 

12.7  The  Performance  of  VLSI  Algorithms 

Using  Theorem  12.6.1  and  Lemma  12.6.4,  we  now  derive  lower  bounds  on  AT 2  and  A2T 
for  individual  functions  by  deriving  lower  bounds  on  their  planar  circuit  size.  In  the  following 
section  we  derive  lower  bounds  to  the  planar  circuit  size  for  multi-output  functions  using  the 
w(u,  v)-flow  property  of  these  functions.  In  Section  12.7.2  we  set  the  stage  for  deriving  lower 
bounds  on  the  planar  circuit  size  of  predicates. 
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12.7.1  The  Performance  of  VLSI  Algorithms  on  Functions 

The  w(m,  u)-flow  property  of  functions  is  introduced  in  Section  10.4.1  and  applied  to  the 
study  of  space-time  tradeoffs  in  the  pebble  game.  In  this  section  we  use  this  property  to  derive 
lower  bounds  on  the  semellective  planar  circuit  size  of  multi-output  functions. 

DEFINITION  12.7.1  A  function  f  :  Xn  i— >  Xm  has  a  w(u,v) -flow  if  for  all  subsets  U\  and 
V\  of  its  n  input  and  m  output  variables  with  \U\\  >  u  and  \  V\  |  >  v  there  is  some  assignment 
to  variables  not  inU\  ( variables  in  Uq)  such  that  the  resulting  subfunction  h  off  that  maps  input 
variables  in  U\  to  output  variables  in  V\  (the  other  outputs  are  discarded)  has  at  least  \X\W(U'A 
points  in  the  image  of  its  domain.  (Note  thatw(u,  v )  >  0.) 


A  lower  bound  on  planar  circuit  size  of  a  function  /  is  now  derived  from  its  w(u,  t;)-flow 
property.  For  some  functions  the  parameter  P  will  need  to  be  large  for  w(u,  v )  >  0,  as  is  seen 
Lemma  12.7.1. 


THEOREM  12.7.1  Let  f  :  Xn  i— >  Xm  have  aw(u,v)-flow.  Then  its  semellective  planar  circuit 
size  must  satisfy  the  following  lower  bound  for  u  >  n{  1  —  3/2  P),  v  >  m/(3P),  and  P  >  2, 
where  K2  =  4(y/2/3  +  l)/(  1  —  -^5/6). 


CP,n(/)  > 


w2(u,  V ) 
~AKf~ 


Proof  Consider  a  minimal  semellective  planar  circuit  for  /  :  X n  i— »  Xm  on  n  inputs  con¬ 
taining  N  =  Cpfi(J")  inputs,  gates,  and  crossings.  We  apply  the  version  of  the  planar  sepa¬ 
rator  theorem  given  in  Lemma  12.6.4  to  this  circuit  by  assigning  unit  weight  to  each  input 
vertex  and  zero  weight  to  all  other  vertices.  For  any  integer  P  <  \V\  we  conclude  that  the 
inputs,  gates,  and  crossings  of  this  circuit  can  be  partitioned  into  q  sets  {A\,  A2, .  .  . ,  Aq}, 
for  2P/3  <  q  <  3 P,  such  that  each  set  has  at  least  n/(3P)  and  at  most  3n/(2P)  input 
vertices.  Since  the  average  number  of  output  vertices  in  these  sets  is  m/q,  at  least  one  set, 
call  it  A\,  has  at  least  the  average  of  output  vertices  or  at  least  m/3P  vertices.  Let  Uq  and 
V\  be  the  sets  ofinputs  and  outputs  in  A 1,  respectively.  Then,  n/ (3P)  <  |I7o|  <  3n/{2P) 
and  |  V[  |  >  m/3P. 

For  some  assignment  of  values  to  variables  in  Uq,  there  are  at  least  |Ar©“’,d  values  for 
the  outputs  in  V\  when  u  =  n  —  |I7o|  >  n(l  —  3/2 P)  and  v  =  | Vf  |  >  m/(3P).  But 
all  of  the  values  assumed  by  the  outputs  in  V\  must  be  assumed  by  the  inputs,  gates,  and 
crossing  wires  of  the  separator.  Since  at  most  two  wires  cross,  a  separator  C  of  size  \C\  has 
at  most  2\C\  inputs,  gates,  and  wires  each  of  which  can  have  at  most  \X\  values.  Thus, 
if  Ci,  the  separator  for  A  1;  has  a  size  satisfying  2|Ci|  <  w(u,v),  a  contradiction  results 
and  the  output  variables  in  V\  cannot  assume  values.  It  follows  that  |Ci|  > 

w{u,v)/ 2.  Since  Ci  <  ifyvTV,  this  implies  that  N  >  w2  (u,  v)  /  (IKf)1 ,  the  desired 
conclusion.  ■ 


We  apply  this  general  result  to  (ct,  n,  m,p) -independent  functions  and  matrix  multiplica¬ 
tion.  A  function  is  (a,  n,  m,p) -independent  (see  Definition  10.4.2)  if  it  has  a  w(u,  w)-flow 
satisfying  w(u,  v )  >  ( v/a )  —  1  for  n  —  u  +  v  <  p,  where  n  —  u  >  0. 
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LEMMA  12.7.1  Let  f  :  Xn  i— >  Xm  be  (a,  n,  rn,  p) -independent.  Then  for  P  >  (m/3  + 
3  n  /  2)  /p  and  m  >  2a,  f  has  semellective  planar  circuit  size  satisfying  the  following  lower  bound: 

~  l44(aP)2K2 

Proof  /  has  a  w{u,  «)-flow  satisfying  w{u,v )  >  {v/a)  —  1  for  n  —  u  +  v  <  p.  When 
u  >  n{  1  —  3/2 P),  n  —  u  +  v  <  p  is  satisfied  if  v  <  p  —  3n/{2P).  Since  we  also  require 
that  v  >  m/(3P),  this  implies  that  P  >  (m/3  +  3n/2)/p.  Also,  v/a  —  1  >  v/2a  if 
v  >  2a.  Substituting  m/3P  for  v,  we  have  the  desired  conclusion.  ■ 

In  Section  10.5  we  have  shown  that  many  functions  are  {a,  n,  m,p)  -independent.  We 
summarize  these  results  below. 


Name 

Function 

Independence  Property 

Wrapped  convolution 

An)  n2n  nn 

J  wrapped 

(2, 2  7i,  n,  n/2) 

Cyclic  shift 

An)  .  Kn+\  log  71]  ^  K>n 
^cyclic  '  ^  ^ 

(2,  n  +  [log  77 ) ,  7i,  n/2) 

Integer  multiplication 

f(n)  .  r?2n  , _ .  yd 2n 

J mult  •  °  ^  ° 

(2, 2  7i,  n,  n/2) 

n-point  DFT 

Fn:TZn^  TZn 

(2,  n,  n,  n/2) 

It  follows  that  for  each  case  Lemma  12.7.1  holds  when  P  <  m/(6a).  Thus,  each  of  the 
{a,  n,  m,p) -independent  function  has  a  planar  circuit  size  that  is  quadratic  in  n,  its  number 
of  inputs.  The  following  theorem  results  from  this  observation  and  Theorem  12.6.1. 

THEOREM  12.7.2  The  area  A  and  time  T  required  to  compute  flfrapped  ■  P2n  i— >  TZn, 
/iyciic  :  Bn+Vosn^  i — ►  Bn,  :  B2n  i — ►  B2n,  and  Fn  :  Kn  ^  Kn  on  a  semellec¬ 

tive  VLSI  chip  satisfy  the  following  bounds: 

AT2,A2T  =Tl(n 2) 

The  AT2  lower  bound  can  be  achieved  up  to  a  constant  multiplicative  factor  for  each  of  these 
functions  for  12  (log  n)  <  T  <  fyn. 

Proof  From  Theorem  12.5.1  we  know  that  any  fully  normal  algorithm  can  achieve  the 
AT2  =  0(n 2)  for  fl(logn)  =  T  =  O(fyn)  on  an  embedded  CCC  network.  Since  cyclic 
shift  and  FFT  are  shown  to  be  fully  normal  (see  Section  7.7),  we  have  matching  upper  and 
lower  bounds  for  them.  From  Problem  12.13  we  have  that  the  wrapped  convolution  can 
be  realized  with  matching  bounds  on  AT2  over  the  same  range  of  values  for  T.  The  same 
statement  applies  to  integer  multiplication  (see  Problem  12.16).  ■ 

In  Section  12.6.1  we  said  that  we  would  exhibit  a  function  whose  planar  circuit  size  is 
nearly  quadratic  in  its  standard  circuit  size.  This  property  holds  for  the  cyclic  shifting  function 
because,  as  shown  in  Section  2.5.2,  /^lic  :  j3n+floen  1  i_ >  Qn  has  circuit  size  no  larger  than 
0(n  log  n),  whereas  from  the  above  its  planar  circuit  size  is  @(n2). 

The  cyclic  shift  function  is  also  an  example  of  a  function  for  which  most  of  the  chip  area 
is  occupied  by  wires  when  T  =  0(  \J n/  log  n) ,  because  in  this  case  the  area  is  f2 {n  log  n)  but 
the  number  of  gates  needed  to  realize  it  is  0(n  log  n). 
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Lower  bounds  on  AT 2  and  A2T  also  exist  for  matrix  multiplication.  From  Lemma  10.5.3 
we  know  that  the  matrix  multiplication  function  /©  B  :  TZ2  ”2  i — ►  Knl  has  a  w(u,  i>)-flow, 
where  w(u,  v)  >  (v  —  (2 n2  —  u)2 /An2) /2.  Using  this  we  have  the  following  lower  bound  on 
the  planar  circuit  size  of  this  function. 

THEOREM  12.7.3  The  area  A  and  time  T  required  to  compute  the  matrix  multiplication  function 
fAxB  ■  2n  l— >  7?-"  with  a  semellective  VLSI  algorithm  satisfies  the  following  lower  hound: 

AT2,A2T  =  H(n4) 

The  AT2  lower  bound  can  be  met  to  within  a  constant  multiplicative  factor. 

Proof  Apply  Theorem  12.7.1  to  matrix  multiplication  by  replacing  the  number  of  input 
variables  n  by  2 n1  and  the  number  of  output  variables  mbyn2.  The  w(u,  !))-iow  function 
has  value 

w{u,v)  =  (v-  (2 n2  -  u)2 /An2)/!  >  y  -  (Jp'j  ^ 

The  right-hand  side  is  maximized  when  P  =  14  and  has  value  greater  than  n2 / 163,  from 
which  the  conclusion  follows. 

As  shown  in  Section  7.5.3,  two  nxn  matrix  can  be  multiplied  with  area  A  =  0(n2)  and 
time  T  =  n,  which  meets  the  lower  bound  up  to  a  multiplicative  factor.  Other  near-optimal 
solutions  also  exist.  (See  Problem  12.15.)  ■ 

12.7.2  The  Performance  of  VLSI  Algorithms  on  Predicates 

The  approach  taken  above  can  be  extended  to  predicates,  functions  whose  range  is  B.  Again 
we  derive  lower  bounds  on  the  size  of  the  smallest  planar  circuit  for  a  function.  However,  since 
the  flow  of  information  from  inputs  to  outputs  is  at  most  one  bit,  we  must  find  some  other 
way  to  measure  the  amount  of  information  that  must  be  exchanged  between  the  two  halves 
of  a  planar  circuit.  An  extension  of  the  communication  complexity  measure  introduced  in 
Section  9.7.1  serves  this  purpose. 

The  communication  complexity  measure  of  Section  9.7.1  assumes  that  two  players  ex¬ 
change  bits  to  compute  the  value  of  a  Boolean  function  /  :  Bn  i— >  B.  The  input  variables 
of  /  are  partitioned  into  two  sets  U  and  V  and  assigned  to  two  players.  Given  this  partition, 
the  players  choose  a  protocol  (a  scheme  for  alternating  the  transmission  of  bits  from  one  to 
the  other)  by  which  to  decide  the  value  of  /  for  every  input  n-tuple  of  /.  The  bits  of  each 
n-tuple  are  partitioned  between  the  two  players  according  to  the  division  of  the  n  input  vari¬ 
ables  between  the  sets  U  and  V.  The  players  then  use  their  protocol  to  determine  the  value  of 
/.  The  communication  complexity  C(U ,  V)  of  this  game  is  the  minimum  over  protocols  of 
the  maximum  over  input  n-tuples  of  the  number  of  bits  exchanged  by  the  players  to  compute 
/  given  the  partition  of  the  input  variables  into  sets  U  and  V.  This  measure  and  its  associated 
game  are  naturally  extended  to  predicates  /  :  Xn  i— >  B,  whose  variables  assume  values  over 
the  set  X.  Players  now  exchange  values  drawn  from  the  set  X. 

We  can  derive  a  lower  bound  on  planar  circuit  size  by  applying  the  planar  separator  theo¬ 
rem.  Since  this  theorem  partitions  the  input  variables  into  three  sets,  A,  B,  and  a  separator  C, 
where  A  and  B  contain  at  most  two-thirds  of  the  total  number  of  input  vertices,  it  is  natural 
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to  extend  the  standard  communication  complexity  measure  to  the  following  VLSI  communi¬ 
cation  complexity  measure  for  functions  /  :  Xn  i— >  B. 

DEFINITION  12.7.2  The  VLSI  communication  complexity  of  a  predicate  f  :  Xn  i — >  B, 

CCvisi(/)>  is  the  minimum  of  the  communication  complexity  C(U ,  V )  over  all  partitions  (U,  V ) 
of  the  variables  of  f  into  two  sets  of  size  at  most  2n/3. 

The  following  theorem,  which  is  left  as  an  exercise  (see  Problem  12.17),  summarizes  the 
result  of  applying  the  VLSI  communication  complexity  measure  CCvisi(/)  together  with  the 
planar  separator  theorem  to  derive  a  lower  bound  on  the  semellective  planar  circuit  size  of 
predicates. 

THEOREM  12.7.4  Let  /  :  Xn  i— >  B  have  VLSI  communication  complexity  CCvis\(f).  Then, 
the  following  bounds  hold  for  the  computation  off  by  a  semellective  VLSI  chip  with  area  A  in  T 
steps. 

(CCvisi(/))2  =  0(AT2),0(A2T) 

Note  that  in  a  planar  circuit  all  the  information  passed  from  each  side  of  the  separator 
to  the  other  is  sent  simultaneously,  whereas  in  the  communication  game  players  alternate  in 
sending  values  drawn  from  the  set  X.  Because  more  freedom  is  granted  to  players  in  the  com¬ 
munication  game  (each  player  can  choose  data  to  send  based  on  responses  previously  received 
from  the  other  player),  a  lower  bound  on  communication  complexity  is  a  lower  bound  on  the 
amount  of  information  that  must  be  exchange  in  a  planar  circuit. 

A  number  of  techniques  have  been  developed  to  derive  lower  bounds  on  the  planar  circuit 
size  of  predicates.  One  of  these  uses  the  pigeonhole  principle  (also  known  as  a  crossing- 
sequence  argument)  to  derive  lower  bounds  for  predicates  that  are  w(u,  v) -separated.  This 
new  property  is  similar  to  the  w(ti,  v)-flow  property  of  multi-output  functions.  It  is  defined 
below. 


DEFINITION  12.7.3  A  function  f  :  Xn  i— >  B  is  w(u,v)  -separated  if  its  variables  can  be  per¬ 
muted  and partitioned  into  three  sets  U,V,  and  Z,  \U\  >  u  and \V\  >  v,  such  that  there  is  some 
value  z  for  variables  in  Z  and  values  Ui  and  Vi,  1  <  i  <  \X\W^U,V\  for  variables  in  U  and  V, 
respectively,  such  that  the  following  holds: 


f{Ui,Vj,z) 


1  if  i  =  j 
0  otherwise 


This  definition  can  be  applied  to  predicates  that  are  associated  with  multi-output  functions. 
These  functions  are  defined  below. 


DEFINITION  12.7.4  The  characteristic  predicate  p  f  :  X^n+rn'>  i— >  B  of  f  :  X^  i— >  X(m)  is 
defined  below. 


Pf(x,y) 


1  ifV  =  f(x ) 

0  otherwise 


It  is  straightforward  to  show  that  the  characteristic  predicate  of  a  function  that  has  a 
w(u,  ^)-flow  is  w(u,  1>) -separated.  (See  Problem  12.18.)  As  a  consequence,  quadratic  lower 
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bounds  exist  on  the  semellective  planar  circuit  size  of  the  characteristic  predicates  of  the  con¬ 
volution,  cyclic  shift,  integer  multiplication,  discrete  Fourier  transform,  matrix  multiplication 
functions,  and  many  others. 


12.8  Area  Bounds 

We  now  derive  lower  bounds  on  the  area  used  by  semellective  VLSI  chip  algorithms  for  a 
variety  of  functions.  For  the  functions  considered  here,  these  bounds  are  linear  in  their  number 
of  variables.  As  explained  in  the  Chapter  Notes,  not  all  functions  are  amenable  to  the  type  of 
analysis  presented  in  this  section. 

The  technique  used  to  derive  area  lower  bounds  is  similar  to  that  used  in  Section  10.4.2 
to  derive  lower  bounds  on  the  exchange  of  space  for  time  in  the  pebble  game.  If  a  chip  has 
many  I/O  ports,  it  has  large  area.  On  the  other  hand,  if  it  has  a  small  number  of  ports,  the 
inputs  to  the  function  computed  are  received  over  many  cycles.  If  the  function  has  a  large 
w(u,  ©flow,  by  direct  analogy  with  the  pebble  game,  the  area  must  be  large  to  insure  that 
enough  information  be  stored  between  cycles. 

THEOREM  12.8.1  Let  (3  >  1.  Iff  :  Xn  i— >  Xm  has  a  w(u,v)-flow,  every  chip  computing  f 
requires  area  A  =  fl(min(  (m/2/3),  w{u,  ©),  where  u  =  n(  1  —  1//3)  andv  =  ( m/4/3 ). 

Proof  If  the  chip  has  7t  I/O  pads  or  can  store  S  values  over  the  alphabet  X,  it  has  area 
A  >  A2  min©  S).  Fix  /3  >  1.  Its  value  is  chosen  later  to  provide  a  strong  lower  bound.  If 
7T  >  m/2/3,  we  are  done.  Thus,  we  show  that  S  >  w(u,  v )  when  7r  <  m/2/3. 

Let  the  VLSI  algorithm  have  T  time  steps  and  let  fy  <  n  outputs  be  generated  on  the 
zth  time  step,  1  <  /  <  T.  Create  q  intervals  of  consecutive  time  steps  as  follows:  The  first 
interval  contains  the  first  k\  time  steps,  where  k\  is  such  that  the  total  number  of  outputs 
produced  during  the  first  k\  steps  is  as  large  as  possible  without  exceeding  m//3.  Successive 
intervals  are  created  in  the  same  way,  namely  by  grouping  consecutive  later  time  steps  to 
satisfy  the  same  requirement  on  the  number  of  outputs  produced.  For  all  intervals  except 
possibly  the  last,  the  number  of  outputs  produced  is  at  least  (m//3)  —  7t  +  1  >  (m/2/3). 
If  the  last  interval  contains  fewer  than  ( m/2/3 )  outputs,  redistribute  the  elements  in  the  last 
two  intervals,  of  which  there  are  at  least  (m//3)  —  7r  +  2  >  (m/2/3)  +  2,  so  that  each  has  at 
least  (m/4/3)  +  1  outputs.  It  follows  that  the  number  of  intervals,  q,  satisfies  (3  <  q  <  4/3. 

We  now  examine  the  inputs  read  during  intervals.  Since  there  are  n  inputs  to  be  read 
and  each  is  read  once,  the  average  number  read  per  interval  is  n/q  which  is  at  most  n/ (3.  It 
follows  that  there  is  some  interval  /  in  which  at  least  (m/4/3)  +  1  outputs  are  pebbled  and 
at  most  n/ i 3  inputs  are  read. 

Fix  the  inputs  that  are  read  during  I.  The  remaining  inputs,  of  which  there  are  at  least 
u  =  n(l  —  1//3),  are  free  to  vary.  The  number  of  outputs  produced  during  I  is  at  least 
v  =  (m/4/3).  Since  /  has  a  w{u,  ©flow,  if  5  <  w(u,  v ),  the  v  outputs,  whose  values  are 
determined  by  the  values  stored  on  the  chip  at  the  beginning  of  I,  cannot  assume  all  their 
values.  It  follows  that  S  >  w(u,v),  which  is  the  desired  conclusion.  ■ 

We  now  apply  this  bound  to  ( a ,  n,  m,p) -independent  functions.  Later  we  apply  it  to  the 
matrix  multiplication  function. 

THEOREM  12.8.2  Let  f  :  Xn  i— >  Xm  be  (a,  n,m,p) -independent.  It  requires  area  A  = 
\2((mp/(n  +  m/4)a)  —  1)  when  realized  by  a  semellective  VLSI  algorithm. 
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Proof  We  apply  Theorem  12.8.1  with  u  =  n(  1  —  1//3)  and  v  =  (m/4/?).  Because  /  is 
(ct,  n,  m,p)  independent,  w(u,v)  >  v/a  —  1  for  n  —  u  +  v  <  p.  Since  n  —  u  =  n/ /?  and 
v  =  (m/4/?),  this  implies  that  (3  >  (n  +  m/4)/p.  The  lower  bound  of  Theorem  12.8.1 
then  is  the  smaller  of  (m/2/?)  and  (m/4a/?)  —  1.  Since  we  are  free  to  choose  /?,  we  choose 
it  to  make  the  smaller  of  the  two  as  large  as  possible.  In  particular,  we  set/?  =  ( n  +  m/4)  /p , 
which  provides  the  desired  result.  ■ 

Because  all  of  the  (a,  n,  m,p) -independent  functions  listed  in  Theorem  12.7.2  have  n, 
m,  and  p  proportional  to  one  another,  each  requires  area  A  =  f l(n),  as  stated  below.  It 
follows  that  the  lower  bound  AT2  =  f l(n2)  for  these  problems  cannot  be  achieved  to  within 
a  constant  multiplicative  factor  if  T  grows  more  rapidly  with  n  than  fin. 

COROLLARY  12.8.1  The  functions  /^pped  :  U2n  ^  W\  f^lic  :  Bn+^n  1  Bn, 

/mult  •  B2n  i— >  B2n,  and  Fn  :  lZn  h- >  lZn  each  require  area  A  =  f l(n)  when  realized  by  a 
semellective  VLSI  algorithm. 

A  similar  result  applies  to  matrix  multiplication. 

THEOREM  12.8.3  The  area  A  required  to  compute  the  matrix  multiplication  function  f^xB  • 
IZ2n  i — >  TZn  with  a  semellective  VLSI  algorithm  satisfies  A  =  f fin2) 

Proof  We  apply  Theorem  12.8.1  with  n  and  m  replaced  by  2 n2  and  n2,  respectively.  Since 
u  =  2n2(l  —  1  //?)  and  v  =  (n2/4/?),  the  lower  bound  on  w(u,  t)-flow  for  matrix  multi¬ 
plication  function  satisfies  the  following 

w{u,  v)  =  (v-  (2 n2  -  u)2/4n2))2  >  ^  ^ 

The  lower  bound  is  a  positive  multiple  of  n2  if  /?  >  4  and  largest  for  /?  =  8,  from  which 
the  desired  conclusion  follows.  ■ 


Problems 

VLSI  COMPUTATIONAL  MODELS 

12. 1  Assume  the  I/O  ports  are  on  the  periphery  of  a  convex  chip.  In  the  speed-of-light  model 
show  that  if  p  such  ports  all  have  paths  to  some  point  on  the  chip,  then  the  time  for 
data  supplied  to  each  port  to  reach  that  point  is  0(p). 

12.2  Under  the  assumptions  of  Problem  12.1,  derive  a  lower  bound  on  the  time  to  compute 
a  function  f  on  n  inputs  under  the  additional  assumption  that  there  is  a  path  on  the 
chip  from  the  port  at  which  each  variable  arrives  to  the  port  at  which  /  is  produced. 
Hint:  Show  that  the  time  required  is  at  least  the  sum  of  the  number  of  cycles  needed 
to  read  all  n  inputs  and  the  time  for  data  to  travel  across  the  chip.  State  these  times  in 
terms  of  p  and  choose  p  to  maximize  the  smaller  of  these  two  lower  bounds. 
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CHIP  LAYOUT 

12.3  Show  that  every  layout  of  a  balanced  binary  tree  on  n  leaves  in  which  the  root  and  the 
leaves  are  placed  on  the  boundary  of  a  convex  region  has  area  proportional  to  n  log  n. 
Hint:  Consider  an  inscribed  quadrilateral  defined  by  the  longest  chord  and  a  chord 
perpendicular  to  it. 

12.4  The  n  x  n  mesh-of-trees  network,  n  =  2r,  is  described  in  Problem  7.4.  Give  an  area- 
efficient  layout  for  an  arbitrary  graph  in  this  family  of  graphs  and  derive  an  expression 
for  its  area. 

12.5  Let  n  =  2k .  As  suggested  in  Fig.  12.9,  the  n  x  n  tree  of  meshes  Tn  is  a  binary  tree 
in  which  each  vertex  is  a  mesh  and  the  meshes  are  decreasing  in  size  with  distance  from 
the  root.  The  edges  between  vertices  are  bundles  of  parallel  wires.  The  root  vertex  is 
an  n  x  n  mesh,  its  immediate  descendants  are  n/2  x  n  meshes,  and  their  immediate 
descendants  are  n/2  x  n/2  descendants,  and  so  on. 

The  depth-d,  n  x  n  mesh  of  trees,  Tn<d,  is  Tn  that  has  been  truncated  to  vertices  at 
distance  d  or  less  from  the  root. 

Determine  the  area  of  an  area-efficient  layout  of  the  tree  TUtd- 

COMPUTATIONAL  INEQUALITIES 

12.6  Lise  the  results  of  Problem  12.11  to  extend  Theorem  12.7.1  to  multilective  planar 
circuits  of  order  fj,. 

12.7  Further  extend  the  results  of  Problem  12.6  to  (/3, /t) -multilective  VLSI  algorithms  by 
showing  that,  at  the  expense  of  a  small  increase  in  AT 2  and  A2T,  multiple  inputs  of  a 
variable  at  the  same  I/O  port  can  be  treated  as  a  single  input,  thereby  possibly  reducing 
the  multilective  order  of  the  corresponding  planar  circuit.  This  implies  that  if  multiple 
copies  of  each  variable  are  read  at  a  single  port,  then  the  semellective  planar  circuit  size 
is  a  lower  bound  to  both  AT2  and  A2T. 


Figure  1 2.9  The  4x4  tree  of  meshes,  T4. 
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THE  PLANAR  SEPARATOR  THEOREM 

12.8  The  pizza  pie  graph  G  =  ( V ,  E)  has  n  =  \V\  —  1  vertices  that  are  uniformly  spaced 
points  on  a  circle  as  well  as  a  vertex  at  the  center  of  the  circle.  E  consists  of  the  arcs 
between  vertices  on  the  circle  and  edges  between  the  central  vertex  and  vertices  on  the 
circle. 

When  n  =  12,  triangulate  G  by  adding  edges  between  vertices  on  its  external  face. 
Illustrate  Lemma  12.6.2  by  choosing  a  cost  function  c  and  constructing  two  sets  whose 
cost  at  most  2  c(V) /3  and  a  separator  containing  at  most  three  vertices. 

12.9  In  a  spanning  tree  for  a  graph  G  =  {V ,  E)  the  level  of  a  vertex  is  the  length  of  the  path 
from  the  root  to  it.  Given  a  non-negative  cost  function  on  the  vertices  of  G  totaling 
c(V),  show  there  is  some  level  m  such  that  the  cost  of  vertices  at  levels  less  than  and 
more  than  m  each  is  at  most  c(V) / 2. 

12.10  (Two-Cost  Planar  Separator  Theorem)  Let  G  =  ( V ,  E)  be  an  i¥ -vertex  planar  graph 
having  non-negative  vertex  costs  summing  to  c(V).  Show  that  V  can  be  partitioned 
into  three  sets,  A,  B,  and  C,  such  that  no  edge  joins  vertices  in  A  with  those  in  B, 
neither  A  nor  B  has  cost  exceeding  7c(V)/9,  |A|  and  \B\  contain  at  most  5 N/6 
vertices,  and  C  contains  no  more  than  Kiy/N  vertices,  where  K\  =  4(-y/2/3  +  1). 
Hint:  Apply  the  planar  separator  theorem  twice.  The  first  time  use  it  to  partition  V 
into  two  sets  of  about  the  same  size  and  a  separator.  If  each  of  the  two  sets  has  cost 
at  most  2c(V)  / 3,  the  result  holds.  If  not,  make  a  second  application  of  the  planar 
separator  theorem  to  the  set  with  larger  cost.  Show  that  it  is  possible  to  combine  sets  to 
simultaneously  meet  both  the  size  and  cost  requirements. 

12. 1 1  Let  G  =  (V,  E)  be  an  N-ve rtex  planar  graph  and  let  c  be  a  non-negative  cost  function 
on  V  with  total  cost  c(V).  Let  P  >2.  Show  there  are  constants  2P/3  <  q  <  3 P  and 
K2  =  4(y/2/3  +  1)/(1  —  -y/ 5/6)  such  that  V  can  be  partitioned  into  q  sets,  A\,  A2, 
.  .  . ,  Aq  such  that  for  1  <  i  <  q 

c(V)/(3P)  <  c(At)  <  3c(V)/(2 P) 

and  there  are  sets  C1,,  \Ci\  <  K2\/ N ,  and  Bi  =  V  —  Ai  —  Ci  such  that  no  edges  join 
vertices  in  Ai  with  vertices  in  Bi . 

Hint:  When  P  =  2,  use  the  result  of  Problem  12.10  and  combine  the  vertices  of  the 
separator  with  the  other  two  sets  to  satisfy  the  necessary  conditions.  When  P  >  2, 
subdivide  any  set  with  cost  exceeding  c(V) / P  into  two  sets  and  a  separator  using  the 
two-cost  planar  separator  theorem.  Assign  vertices  of  the  separator  to  these  two  sets  to 
keep  the  cost  in  balance. 

THE  PERFORMANCE  OF  VLSI  ALGORITHMS 

12.12  Show  that  the  function  defined  by  the  product  of  three  square  matrices  has  a  semel- 
lective  planar  circuit  size  that  is  quadratic  in  its  number  of  variables  and  that  it  can  be 
realized  by  a  VLSI  chip  with  AT 2  that  meets  the  semellective  planar  circuit  size  lower 
bound. 

12.13  Show  that  the  wrapped  convolution  function  /^,”^pped  :  l— >  AI",  can  be  realized 

as  an  embedded  CCC  network  on  a  VLSI  circuit  with  area  A  and  time  T  satisfying 
AT 2  =  0(n2)  for  fl(logn)  <T<  sfn. 
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12.14  Design  a  VLSI  chip  for  nxn  matrix  multiplication  that  achieves  AT 2  =  n4  log2  n  for 
T  =  O(logn). 

Hint:  Represent  each  matrix  as  a  2  x  2  matrix  of  ( n/2 )  x  {n/2)  matrices  and  use  the 
standard  algorithm  that  performs  eight  multiplications  of  {n/2)  X  {n/2)  matrices.  A 
multiplier  has  one  side  longer  than  the  other.  Place  the  long  side  of  the  {n/2)  x  {n/2) 
matrix  multiplier  at  right  angles  to  the  long  side  of  the  nxn  matrix  multiplier.  Apply 
this  rule  to  the  recursive  construction  of  the  multiplier. 

12.15  Show  that  an  algorithm  of  the  kind  described  in  Problem  12.14  can  be  combined  with 
a  mesh-based  matrix  multiplication  algorithm  of  the  kind  described  in  Section  7.5.3  to 
produce  a  family  of  algorithms  that  achieve  the  lower  bound  on  nxn  matrix  multipli¬ 
cation  for  fi(log  n)  <  T  <  n. 

12.16  Devise  a  VLSI  chip  for  n-bit  integer  multiplication  function  chip  that  uses  area  A  and 
time  T  efficiently. 

Hint:  Let  x  and  y  denote  binary  numbers.  Recursively  form  the  product  of  these 
integers  as  the  sum  of  two  products,  that  of  x  with  the  high-order  {n/2)  bits  of  y  and 
that  of  x  with  the  low-order  {n/2)  bits  of  y.  Use  carry-save  addition  where  possible. 

12.17  Give  a  proof  of  Theorem  12.7.4. 

12.18  Show  that  the  characteristic  predicate  of  a  function  that  has  a  w{u,  u)-flow  is  w{u,  v)- 
separated. 

AREA  BOUNDS 

12.19  Show  that  any  VLSI  algorithm  that  realizes  a  superconcentrator  on  n  inputs  requires 
area  0(n). 

Chapter  Notes 

Mead  and  Conway  wrote  an  influential  book  [2 1 3]  that  greatly  simplified  the  design  rules  for 
VLSI  chips  and  made  VLSI  design  accessible  to  a  large  audience.  Ullman  [339]  summarized 
the  status  of  the  field  around  1984  and  Lengauer  [193]  addressed  the  VLSI  layout  problem. 
Lengauer  has  also  written  a  survey  paper  [194]  that  provides  an  overview  of  the  theory  of  VLSI 
algorithms  as  of  about  1990.  The  three  transmission  models  described  in  Section  12.2  reflect 
the  analysis  of  Zhou,  Preparata,  and  Khang  [372]. 

Thompson  [326]  obtained  the  first  important  tradeoff  results  for  the  VLSI  model  of  com¬ 
putation.  He  demonstrated  that  under  a  suitable  model  a  lower  bound  of  AT2  =  12  (n2) 
could  be  derived  for  the  discrete  Fourier  transform,  a  result  he  subsequently  extended  to  sort¬ 
ing  [327].  Generalizations  of  this  model  were  made  to  convex  chips  [59],  compact  plane 
regions  [195],  and  other  closely  related  models  [202],  Vuillemin  [355]  extended  the  models 
to  include  pipelining.  Chazelle  and  Monier  [67]  introduced  the  transmission-line  model  de¬ 
scribed  in  Problems  12.1  and  12.2.  For  a  discussion  of  other  models  that  take  into  account  the 
effects  of  distributed  resistance,  capacitance  and  inductance,  see  [40]  and  [372], 

Systolic  algorithms,  which  make  good  use  of  area  and  time,  were  popularized  by  Rung 
[177]  and  others  (see,  for  example,  [104,122,179,180,181,190]).  The  H-tree  featured  in  Sec¬ 
tion  12.5.1  is  due  to  Mead  and  Rem  [214].  Prefix  computations  are  discussed  in  Chapter  2. 
The  cube-connected  cycles  network  (its  layout  is  given  in  Section  12.5.3)  and  the  efficient 


602 


Chapter  12  VLSI  Models  of  Computation 


Models  of  Computation 


realization  of  normal  algorithms  are  due  to  Preparata  and  Vuillemin  [262],  as  explained  in 
Chapter  7.  Lengauer  [193]  provides  an  in-depth  treatment  of  algorithms  for  VLSI  chip  lay¬ 
out. 

Most  authors  prefer  to  derive  lower  bounds  on  AT1  by  partitioning  the  planar  region  oc¬ 
cupied  by  chips  [59,195,326].  In  effect,  they  employ  a  physical  version  of  the  planar  separator 
theorem.  The  characterization  of  VLSI  lower  bounds  in  terms  of  planar  circuit  complexity  in¬ 
troduced  by  Savage  [288]  reinforces  the  connection  between  memoryless  and  memory-based 
computation  explored  in  Chapter  3  but  for  planar  computations  by  VLSI  chips.  It  also  pro¬ 
vides  an  opportunity  to  introduce  the  elegant  planar  separator  theorem  of  Lip  ton  and  Tarjan 
[203].  Lipton  and  Tarjan  [204]  developed  quadratic  lower  bounds  on  the  planar  circuit  size  of 
shifting  and  matrix  multiplication  before  the  connection  was  established  between  VLSI  com¬ 
plexity  and  planar  circuit  size.  Improving  upon  results  of  [288],  McColl  [209]  and  McColl  and 
Paterson  [2 1 0]  show  that  almost  all  Boolean  functions  on  n  variables  require  a  planar  circuit 
size  of  fl(2ra)  and  that  this  lower  bound  can  be  achieved  for  all  functions  to  within  a  constant 
multiplicative  factor  close  to  1.  Turan  [336]  has  shown  that  the  upper  bound  of  Lemma  12.6.1 
is  tight  by  exhibiting  a  family  of  functions  of  linear  standard  circuit  size  whose  planar  circuit 
size  is  quadratic. 

Abelson  [1]  and  Yao  [366]  studied  communication  complexity  with  fixed  partitions.  Yao 
[367]  and  Lipton  and  Sedgewick  [202]  made  explicit  the  implicit  connection  between  VLSI 
communication  complexity  and  the  derivation  of  the  AT2  lower  bounds.  (See  also  [236], 
[12],  and  [194]  for  a  discussion  of  the  conditions  under  which  lower  bounds  can  be  derived 
on  the  VLSI  communication  complexity  measure.) 

Many  authors  have  contributed  to  the  derivation  of  semellective  lower  bounds  for  partic¬ 
ular  functions.  Among  these  are  Thompson  [326,327,328,329],  who  obtained  bounds  of  the 
form  AT2  =  n(n2)  for  the  DFT  and  sorting,  as  did  Abelson  and  Andreae  [3]  and  Brent 
and  Kung  [59]  for  integer  multiplication,  JaJa  and  Kumar  [149]  for  a  variety  of  problems,  Bi- 
lardi  and  Preparata  [41]  for  sorting,  Savage  for  matrix  multiplication,  inversion,  and  transitive 
closure  [289]  and  binary  integer  powers  and  reciprocals  [288],  and  Vuillemin  for  transitive 
functions  [355]  (see  Problem  10.22).  These  authors  generally  show  that  the  lower  bounds  for 
functions  can  be  met  either  to  within  a  small  multiplicative  constant  factor. 

Good  VLSI  designs  have  been  given  by  Baudet,  Preparata,  and  Vuillemin  [31]  for  con¬ 
volution,  Guibas  and  Liang  [123]  for  systolic  stacks,  queues,  and  counters,  and  Kung  and 
Song  [183]  and  Kung,  Ruane,  and  Yen  [182]  on  2D  convolution.  Also,  Luk  and  Vuillemin 
[207]  give  an  optimal  VLSI  integer  multiplier  and  Mehlhorn  has  provided  optimal  algorithms 
for  integer  division  and  square  rooting  [217]  whose  range  of  optimality  has  been  extended 
by  Mehlhorn  and  Preparata  [219].  Preparata  [258]  has  given  a  mesh-based  optimal  VLSI 
multiplier  for  large  integers  and  Preparata  and  Vuillemin  have  given  optimal  algorithms  for 
multiplying  square  [260]  and  triangular  matrices  [261],  C.  Savage  [284]  has  given  a  systolic 
algorithm  for  graph  connectivity. 

Lower  bounds  for  the  semellective  computation  of  predicates  by  VLSI  algorithms  have 
been  derived  by  Yao  [367]  for  graph  isomorphism,  by  Lipton  and  Sedgewick  [202]  for  the 
recognition  of  context-free  languages,  pattern  matching,  and  binary  integer  factorization  test¬ 
ing,  and  by  Savage  [288]  for  the  characteristic  predicates  of  multi-output  functions. 

Hochschild  [134],  Kedem  and  Zorat  [163,164],  Savage  [290,291],  and  Turan  [337]  have 
developed  lower  bounds  on  performance  of  multilective  VLSI  algorithms.  Savage  has  explored 
multilective  planar  circuit  size  [291],  giving  a  multi-output  function  with  a  f2(n4/3)  lower 
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bound.  Turan  [337]  exhibits  a  function  and  a  predicate  with  Ll(n3/2  log  n)  and  f2(nlogn) 
lower  bounds  to  their  multilective  planar  circuit  size,  respectively.  The  w(u,  t;)-flow  and 
w(u,  t;) -separated  properties  used  in  Section  12.7  were  introduced  in  [291]. 

Lower  bounds  on  the  area  of  chips  have  been  explored  by  a  number  of  authors.  Yao  [367] 
examined  addition;  Baudet  [30]  studied  functions  that  do  not  have  a  large  information  flow; 
Heintz  [131]  derived  bounds  for  matrix-matrix  multiplication;  Leighton  [191]  introduced  and 
used  the  crossing  number  of  a  graph  to  derive  area  bounds;  Siegel  [309]  derived  bounds  for 
sorting;  and  Savage  [288]  examined  functions  with  many  subfunctions.  Bilardi  and  Preparata 
[42]  have  generalized  arguments  of  [30]  and  [152]  to  derive  stronger  area-time  lower  bounds 
for  functions,  such  as  prefix,  for  which  the  information  flow  arguments  give  weak  results. 
Lower  bounds  on  the  area  of  multilective  chips  were  obtained  by  Savage  [291],  Hromkovic 
[142,143],  and  Duris  and  Galil  [93]. 

Models  for  3D  VLSI  chips,  which  are  not  yet  a  reality,  have  been  introduced  by  Rosenberg 
[282,283]  and  studied  by  Preparata  [263]. 
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TM  vs  circuit  size,  as  tool  to  resolve  P  =  NP 
equality  question  128 
transitive  closure  frunction  249 

composition, 

function  231 
log-space  TM  351 

computability, 

(chapter)  209 

feasible  problems,  serial  computation  thesis 
330 

computation, 

bounded,  impossibility  theorem  for  24 

capacity  532 

circuit, 

equivalence  between  FSM  and  96 
model,  logic  circuits  as  16(*) 
reductions  of  TM  computations  to  128(*) 
cost,  with  HMM  563 
data-dependent,  branching  programs  488 
function,  by  standard  TM  210 
locality  of  reference  558 
multilective  580 
on  a  branching  program  489 
parallel  27(*) 

(chapter)  281 
circuit  models  372 
period  582 
prefix  55(*),  583(*) 
read-once  580 

restricted  models  of,  representing  217(*) 

semellective  580 

serial,  thesis  330 

step,  red-blue  pebble  game  530 

time, 

in  the  VLSI  synchronous  model  579 
pebbling  strategy  531 
computational, 
complexity  23(*) 
brief  history  5 
inequalities  230 
for  FSM  950 
for  interconnected  FSMs  97 


computational  (cont.) 
inequalities  (cont.) 

for  the  random-access  memory  1170 
for  the  TM  127  134  1270 
RAM  118 
VLSI  chips  5870 
models  16(*) 

branching  program  comparison 

with  493  O 
parallel  282(*) 

(part  I  -  chapters  2-7)  35 
serial  3310 
VLSI  5790 
VLSI,  (chapter)  5750 
time,  TM  relationship  to  circuit  complexity 

5 

work, 

on  FSM  96 
on  PRAM  290 
computer(s), 

balanced  systems  532(*) 
distributed  memory  284  285 
distributed  shared  memory  285 
networked  2870 
parallel, 

Brent’s  principle  29 1  (*) 

Flynn’s  taxonomy  285 
memoryless  282(*) 
synchronous  285 

unstructured,  circuit  as  form  of  283 
with  memory  2830 
science  3 

shared  memory  284 
concatenation 

CFL  closed  under  198 
NFSM  164 
string  158 
concurrency, 

See  also ,  PRAM 
power  of  314(*) 

conditional  vector  operations286 
configuration, 

graph  218  334  340 
TM,  k-tape  218 
connection  network289 
context-free  grammar22  183 
Chomsky  normal  form  1 87 
context-sensitive  grammar  183 
context-sensitive  language(s)  I  83(*)>  183 
Chomsky, 

hierarchy  component  5 
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context-sensitive  language(s)  (cont.) 
Chomsky  (cont.) 

language  type  1 82 

machine  type  that  corresponds  to,  (table) 
182 

contradiction,  proof  by  1 5(*) 
control, 

CPU,  simple  CPU  142(*) 
unit  20 
PDA  177 
standard  TM  2 1 0 
TM  118 
variable  144  474 
controllers406 
convolution, 

Boolean, 

circuit  size  419 

function,  circuit  size  lower  bound  422 
as  Borodin-Cook  lower-bound  method, 
application  5050 
complexity  of  fast  algorithm  270 
FFT-based  algorithm  269 
and  263(*) 

function  268 

I/O  time  bounds  553 
space-I/O  time  tradeoffs  552(*) 
systolic  arrays  and  28 
theorem  2680 

wrapped  276  473(*),  474  505 
space-time  lower  bound  505 
Conway,  L.323  613  615 
Conway,  L.  A.601 

Cook,  S.  A.72  88  152  323  389  390  497  504 
526  527  528  606  607  608 

Cooley,  J.  W.279  608 
Coppersmith,  D.245  278  608 
corollaries, 

area  lower  bounds,  for  independent 
functions,  (12.8.1)  598 
Boolean  convolution  function  circuit  size 
lower  bound,  (9.6.1)  422 
containment  between  time-bounded 

complexity  classes,  (8.5.2)  341 
distinguishable  functions,  space-time  lower 
bound  for,  (10.11.1)  500 
existence  of  languages  not  in  P,  (8.6.1)  343 
FFT  decomposition,  (6.7.1)  267 
FSM,  minimal-state,  characterization  of, 
(4.7.1)  174 


corollaries  (cont.) 

Grigoriev’s  lower-bound  method,  (10.4.1) 
471 

I/O  complexity  bounds,  multi-level,  (1 1.4.1) 
539 

languages,  accepted  by  NDTM  accepted  by 
DTM,  (5.2.1)  216 

matrix  multiplication  function,  vis-a-vis 
transitive  closure,  (6.4.1)  250 
nondeterministic  space  classes  closed  under 
complements,  (8.6.2)  346 
Savitch’s  Theorem,  (8.5.1)  340 
separator  theorem  for  trees,  (9.2.1)  397 
space-time  product  lower  bound  for 

independent  functions,  (10.4.1) 

471 

time-bounded  and  space-bounded 

complexity  class  relationships, 
(8.5.2)  341 

Turing  machine  time  lower  bounds,  (3.9.1) 
128 

counter, 

incrementing/decrementing  148 
modulo-p  148 

counting, 

binary  trees  78 
function  75 

CPU  (central  processing  unit)  19  110 

booting  141 

circuit  size  and  depth  1460 
control,  simple  CPU  142(*) 
simple, 

design  111  1370 
instructions  140 
micro-instructions  142 
microcode  142 
registers  138 

timing,  simple  CPU  142(*) 

CRCW  (Concurrent  Read/Concurrent 
Write)  PRAM313 

computing  Boolean  functions  on  314 
simulation  by  EREW  PRAM  314 

CREW  (Concurrent  Read/Exclusive  Write) 
PRAM313 

circuits, 
and  3170 

equivalence  3760,  379 
simulation  by  377 
P-complete  problems  380 
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CREW  (Concurrent  Read/Exclusive  Write) 
PRAM  (cont.) 

realization  of  log-space  transformations  on 
380 

crossbar  network289 
crossing-sequence  argument596 
cryptography, 

brief  history  7 
Csanky,  L.279  609 
Csanlty’s  algorithm262 

fast  matrix  inversion  with  260 
Csirik,  J.620 
Culler,  D.  E.323  609 

cycle  (s)  10 

cube-connected,  See  CCC 
fetch-and-execute  20 
inside  of  the  590 

cyclic  shifting474(*) 

circuit  49 

efficient  branching  programs  for  496(*) 
functions  48  474 
circuits  for  50 

independence  properties  474 
reductions  between  logical  shifting 
functions  51 

space-time  lower  bound  475 
on  the  hypercube  303(*),  304 
reductions,  between  logical  and  50(*) 
Cypher,  R.323  609 

D 

DAG  (directed  acyclic  graph)  1 0 

adjacency  matrix  relationship  248 
circuits  238 

convolution  theorem,  FFT  application  269 
logic  circuit  as  a  1 6 
maximal  path  length  249 
space  upper  bounds  483 

DAM  (directed  acyclic  multigraph)489 
data-dependent  computation, 

branching  programs  488 

dataflow  computer, 

circuit  simulation  by  283 
decidable  languages223(*),  225 
standard  TM  210 

decimal, 

standard  representation  8 

decision, 

binary  decision  diagram  490 


decision  (cont.) 

branching  program  489 
problems  328 

classification  of  334(*) 
complement  of  329 

language  complements  and  329(*),  330 
regular  languages,  algorithms  171 

tree  489 

multiway  561 
decoder53(*) 
function  53 
circuit  for  54 
definitions, 
basis, 

measure  397 
of  a  circuit  38 
big, 

Oh  notation,  0(  )  13 
Omega  notation,  fl(  )  13 
Theta  notation,  0  (  )  1 3 
bilinear  form  420 
block  I/O  model  557 
Boolean, 

convolution  function  419 
function  class  400  403 
matrix  multiplication  422 
branching  program  489 
space  on  490 
BTM  559 
central  slice  435 
Chomsky  normal  form  187 
circuit  38 
depth  40  394 
depth  with  fan-out  s  394 
family  373 

family,  log-space  uniform  373 
planar  circuit  size  586(*) 
size  40  393 

size  with  fan-out  s  393 
communication, 

complexity,  of  a  communication  game 

437 

complexity,  VLSI  596 
commutative  rings  264 
complete  problems  351 
complexity, 
class  334 

class,  complements  343 
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definitions  (cont.) 
complexity  (cont.) 

communication,  of  a  communication 

game  437 

communication,  VLSI  596 
computation  on  a  branching  program  489 
configuration, 
graph  218 
k-tape  TM  218 

decision  problems  and  their  languages  328 
depth,  circuit  40 

derivation,  phrase-structure  grammar  1 82 
DFSM  154 

equivalence  classes  176 
{<j>,  A,  p,,  v,  r)-distinguishability  497 
DTM  119 

language  in  P  120 
equivalence, 
classes  172 
relation,  DFSM  172 
relation,  for  a  language  172 
relation,  refinement  173 
relation,  right-invarian  172 
right-invariant,  for  a  language  173 
states  175 

expressions,  regular  158 
final  state,  FSM  92 
formula,  size  394 
FSM, 

computational  work  on  96 
next-state  function  92 
output  alphabet  92 
output  function  92 
functions, 
advice  382 

computed  by  straight-line  programs,  38 

next-state,  FSM  92 

pairing  382 

partial  recursive  232 

polynomial  advice  382 

primitive  recursive  23 1 

proper  330 

reductions  between  46 
symmetric  74 

general  branching  program  490 
goal,  communication  game  437 
grammar, 

context-free  183 
context-sensitive  183 
phrase-structure  182 
regular  184 


definitions  (cont.) 

hard  problems  351 
hierarchical  memory  model  563 
I/O, 

operation  559 
operations,  simple  560 
time  559 

immediate  derivation,  phrase-structure 
grammar  1 82 

implicants  417 

(a,  n,  m,p) -independent  function  469 

induction  hypothesis  1 5 

initial  state,  FSM,  92 

input  alphabet,  FSM  92 

Kronecker  product  503 

language, 

CIRCUIT  SAT  132 
CIRCUIT  VALUE  130 
context-free  183 
context-sensitive  183 
FAN-OUT  TWO  CIRCUIT  SAT  language 
150 
in  NP  120 
in  P  120 

MONOTONE  CIRCUIT  VALUE  150 
P-complete  130 
phrase-structure  182 
recognition,  FSM  92 
regular  158  184 
SATISFIABILITY  132 
matrix, 

multiplication  ring  operations  245 
nice  and  ok  501 
monotone, 

communication  game  441 
function  replacement  rule  418 
multigraph  489 

multiplication,  smallest  circuit  67 
n-indistinguishable  175 
NC  languages  380 
NDTM  120  214 
language  in  NP  120 
NFSM  154 

non-terminals,  phrase-structure  grammar 
182 

notation, 

big  Oh,  0(  )  13 
big  Omega,  )  13 
big  Theta,  0(  )  13 
P  and  NP, 

complete  problems  352 
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definitions  (cont.) 

P  and  NP  (cont.) 

problems  335 
P/poly  languages  383 
permutation  74 
planar  circuit  size  586(*) 
programs,  straight-line  38 
proof  by  contradiction  15 
protocol,  communication  game  437 
reducibility  226 
reduction, 

between  functions  46 
via  subfunction  relationship  46 
regular, 

expressions  158 
languages  158 
sets  158 
rings  239(*) 

5-span  of  a  DAG  537 
set, 

neighborhood  408 
of  states,  FSM  92 
regular  158 
size,  circuit  40 
slice  functions  43 1 

space-bounded  complexity  classes  338 
SPD  matrices  253 

start  symbol,  phrase-structure  grammar  1 82 
straight-line  programs  38 
functions  computed  by  38 
subfunctions  46 
reductions  via  46 
superconcentrator  485 
terminals,  phrase-structure  grammar  1 82 
time-bounded  complexity  classes  337 
TM, 

canonical  encoding  221 
configuration  218 
standard  210 
transformation  348 

and  complexity  class  relationships  350 
classes  350 
transitive, 

closure,  phrase-structure  grammar  1 82 
transformations  350 
unique  elements  514 
universal  114 
vertex-disjoint  485 
w(u,v)-flow  469 

degree, 
in  10 


degree  (cont.) 

out  10 

Dekel,  E.323  609 
Demetrovics,  J.620 
DeMorgan’s  rules, 

Boolean  expressions  4 1 

demultiplexers  5  (*) 
dependent  variables399 
depth, 

circuit  1 1  35  40  239  394  436(*) 
basis  change  effect  on  396(*) 
bounded  448(*) 

errors  with  b-approximator  of  448 
formula  size  vs  39 6(*) 
in  a  simple  CPU  146(*) 
monotone  communication  game 
relationship,  441 

relationship  between  formula  size  and  397 
simple  lower  bounds  on  400 
with  fan-out  s  394 
communication  complexity, 
relationship  438  (*) 

lower  bound,  for  most  Boolean  functions  79 
monotone, 

clique  function  442(*) 
communication  complexity  relationship 

440(*) 

upper  bound,  for  all  Boolean  functions  80 
derivationl81 
immediate  182 
leftmost  186 
parsing  186 
rightmost  186 

descending  algorithms301  306  307 
designing, 

circuits  36(*) 

deterministic, 

FSM,  See  DFSM 
PDA  177 

Turing  machine,  See  DTM 

DFSM  (deterministic  finite-state  machine)98 

154 

See  also,  FSM;  NFSM 
equivalence  relation  172 
languages  accepted  by,  same  as  languages 
accepted  by  NFSMs  156 
minimal,  equivalence  relation  177 
NFSM  equivalence  156 
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DFT  (discrete  Fourier  transform)  263  264  (*) 

as  Borodin-Cook  lower-bound  method, 
application  513(*) 
independence  properties  479 
inverse  265 
space-time, 

lower  bounds  480  513 
product  479(*) 
vector-matrix  product  513 
diagonalization225 

diagram, 

binary  decision  490 
state  18  30 
FSM21 

diameter, 

graph,  network  287 
Diaz,  J.389  606 

difference, 

languages  170 
sets  7 

symmetric,  between  sets  234 

diffusion  model, 

VLSI  579 

Ding-Zhu,  Du618 
directed  graph  10 
directed  multigraph489 
discrete  Fourier  transform, 

See,  DFT 
disjoint  sets7 

{<j>.  A,  //,  v,  r)-distinguishability  property497 
distinguishability  properties, 

flow  property  relationship  to  500 
matrix  multiplication  509 
matrix-vector  product  508 
((/>,  A,  fj,,  v,t)  497 
unique  elements  515 
distributed, 

computing,  brief  history  6 
memory  computer  284  285 
routing  in  309 

shared  memory  computer  285 

distributive  laws4l 
distributivity, 

Boolean  expressions  42 
divide-and-conquer, 
multiplier,  circuit  for  67 
strategies,  trees  288 
division, 

of  integers  68 
reciprocal  and  68 
domain, 


of  a  function  1 1 
dominant  terms, 

as  rate  of  grown  indicator  1 3 
big  Oh  notation,  0(  )  13 
big  Omega  notation,  fl(  )  13 
big  Theta  notation,  0(  )  13 
doped  layer576 
doping576 

DTM  ACCEPTANCE  language354 

DTM  (deterministic  Turing  machine), 

language  acceptance  333 
language  in  P  120 
multi-tape  333 
P  problems  335 
polynomial-time  330  374(*) 
recursive  language  333 
simulation  of  RAM  332 
standard  118  210(*) 
dual-rail  logic84 
Duff,  I.  S.613 

Dunne,  P.  E.89  457  458  609 
Duris,  P.603  609 
dyadic  unate  basis392 
dynamic  programming  algorithm  165 

E 

Earley,  J.207  609 
Eckert4 

Eckstein,  D.  M.323  609 

edge  (s)  10 

Edmonds,  J.330  609 
efficiency, 

PRAM  290 
eigenvalues26 1 
eigenvector26 1 
electronic  lockl48 
elementary  symmetric  functions74 
elimination  method, 
gates, 

for  circuit  size  400(*) 
general  circuits  400 
Gaussian  274 

paths,  monotone  circuits,  lower  bounds 
derivation  413(*) 

embedding, 

1 D  arrays  in  2D  meshes  297 (*) 
arrays  in  hypercubes  299(*) 
graph,  problem  289 

empty, 
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set,  acceptance  problem  229 
string  181 

empty  (cont.) 

tape  acceptance  problem  228 

emulation  1 47  (*) 
encoder5 1  (*) 

circuit  52 

function,  circuit  for  52 

encoding, 

canonical,  of  TM  221 
string,  TM  and  222(*) 
unary  383 

end-of-tape  marker2 1 0 
endpoint, 

set  426 
size  426 

ENIAC4 

enumeration  tape2 1 5 
equivalence, 

class  10  172 

DFSM  and  NFSM  156(*) 
regular  expressions  159 
relations  10  172 
DFSM  172 
on  languages  171  (*) 
on  states  171  (*) 
refinement  173 
right-invariant  172 
right-invariant  173 
states  175 

ERCW  (Exclusive  Read/Concurrent  Write) 
PRAM313 

EREW  (Exclusive  Read/Exclusive  Write) 
PRAM  simulation3 1 3 

by  hypercube  network  317 
CRCW  PRAM  314 

of  normal  algorithm  313 
error  function45 1 
Evey,  J.207  609 
EXACT  COVER  language360 
exclusive  access, 

PRAM  312 

existential  quantification365 
expansion, 

series,  Taylor  73 
sum-of-products  44 

exponential, 

functions  13 


exponential  (cont.) 

size,  bounded-depth  parity  circuits  448  450 
time,  polynomial  time  compared  with  330 
expressions, 

See,  regular  expressions 
EXPTIME  class337 

complexity  class  relationships  341 
extreme  tradeoffs466(*),  467 

F 

face, 

planar  graph  590 
factorization, 
prime  87 
Schur  254(*) 

Faddeev,  D.  K.279  609 
Faddeeva,  V.  N.279  609 
fan-in, 
circuit  38 
of  a  basis  392 
trees  394 
fan-out, 
circuit  38 

size  impact  394(*) 
reduction  150 

construction  used  for  215 
fan-out-1  circuit392  393 

relationship  to  formula  size  394 
fast  Fourier  transform, 

See,  FFT 
feasible381 

problems  335 
Feig,  E.573  606 

fetch-and-execute  cycle20  110  138  139(*) 
FFT  (fast  Fourier  transform), 

algorithm  266(*),  267  301 
convolution  and  263  (*) 
circuit  266 

convolution  application  269 
decomposition  267 
graph, 

butterfly  238 
decomposition  267  548 
pebbling  463 
pebbling  of  25 

with  column  numberings  302 
I/O  time  bounds, 

in  red-blue  pebble  game  547 
MHG  549  551 
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FFT  (fast  Fourier  transform)  (cont.) 

lower  performance  bounds  565 
S-span  546 

space-I/O  time  tradeoffs  546(*) 
straight-line  program  for  238 
Fich,  F.528  607 
field274 

FIFO  (first-in,  first-out), 

LRU  analysis  relative  to  568 
page-replacement  algorithm  568 

final  state92  1 54 

fine-grained  parallel  computers283  284 
finite, 

functions  12 
language  9 

finite-state  machine, 

See,  FSM 

first  order  linear  recurrence86 
Fischer,  C.  N609 

Fischer,  M.  J.152  456  528  607  609  613  616 

flip-flop  109 
floor  function  13 

FLOP  (floating  point  operations  per 
second)282 
flow  properties, 

distinguishability  property  relationship  to 
500 

functions  469(*) 
matrix  multiplication  477 

Flynn,  M.  J.323  609 
Flynn’s  taxonomy, 

parallel  computers  285 

form, 

bilinear  420 

semi-disjoint  420 

semi-disjoint,  replacement  rule  420 

formal, 

computational  models  4 
languages  4  21 
brief  history  5 

Chomsky  language  hierarchy  5 

formula, 

fan-out- 1  circuit  392  393 
size  394 

bounds  on  397 
circuit  depth  vs  396(*) 
fan-out- 1  relationship  394 
lower  bounds  for  404(*) 
over  two  different  bases  399 
Fortune,  S.323  390  609 
Foster,  M.  J.601  609 


Fourier, 

See,  DFT;  FFT 

Fraleigh,  J.  B.609 

Friedman,  J.456  610 

FSM  (finite-state  machine)92(*) 

See  also,  DFSM;  NFSM 
adder  108 

adding  two  binary  numbers  101 
bounded, 

circuits  and  96(*) 
brief  history  5 
(chapter)  1 53(*) 
choice  input  99 
circuit, 

compared  with  94 
computation  equivalence  96 
for  23 

simulation  of  95 
computational, 

inequalities  95(*)>  95 
model  18(*) 
work  on  96 

computing  exclusive  or  of  its  inputs  93 
decision  problems,  algorithms  171 
deterministic  98 

equivalence  of  DFSM  and  NFSM  1 56(*) 
exclusive  or  computation  97 
functions  computed  by  22  94(*),  95 
interconnction  97 
language, 

are  regular  185 
association  with  173 
described  by  regular  expressions  1 64 
recognition  by  22 
minimal  algorithm  for  175(*),  176 
models  154(*) 
nondeterministic  98(*),  154 
PDA  control  unit  as  a  177 
pumping  lemma  for  168(*) 

RAM  as  11 1(*) 
regular  expressions, 
recognition  by  160(*) 
relationship  with  158 
regular  language  recognition  by  1 84 
ripple  adder  simulaton  by  107 
simulating  with  shallow  circuits  100(*) 
state-diagram  2 1 
synchronous  97 

circuit  simulating  98 
interconnection  of  97 (*) 
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FSM  (finite-state  machine)  (cont.) 

TM, 

control  unit  as  1 1 8 
relationship  217 
tape  unit  as  1 1 8 
universal  RAM  for  114 
VLSI  chip  design  use  27 
full  adder59 

carry-save  adder  realization  by  64 
circuit  18 

full  two-input  basis392 
fully  normal  algorithms30 1  306 

on  2D  arrays  307 

function(s)10  11 

addition  58  60  231 
advice,  polynomial  382 
binary  12  39 

tree  circuits  for  78 
Boolean  12 

algebraic  properties  of  40(*) 
circuit-size  lower  bound  for  most  77 
circuit-size  upper  bound  for  all  82 
class  400  403 
class  Q 23)  401 
complex  77(*) 

computing  on  CRCW  PRAM  314 
depth  lower  bound  for  most  79 
depth  upper  bound  for  all  80 
(fc,  s)-Lupanov  representation  in  81 
logic  gate  implementation  of  16 
maxterm  of  43 
minterm  of  42 
negations  409(*),  410  411 
sum  of  44 
carry-generate  103 
carry-propagate  103 
carry-terminate  103 
ceiling  13 

characteristic  13  375 
circuits  that  compute  39(*) 
comparator  270 
complete  11119 
composition  23 1 

computation,  by  standard  TM  210 
computed  by, 
circuit  38(*),  392 
DTM  119 

FSM  22  92  94(*),  95 
straight-line  program  38 
TM  230(*) 

domain  1 1 


function(s)  (cont.) 

error  451 
exponential  13 
finite  12 
floor  13 

implication  410 
linear  13 
logarithmic  13 
monotone  85  392  418 
naming  16 
next-set  154 
next-state  18 
DFSM  154 
DTM  119 
NDTM  120 
RAM  120 
standard  TM  210 
output  18 
pairing  382 
partial  1 1 

DTM  119  333 
recursive  231  232  233(*) 
standard  TM  210 

polynomial,  as  real  number  functions  1 3 
predecessor  232 
prefix  55 

parallel,  circuit  for  57 
primitive  recursive  23 1(*) 
projection  231 
proper  subtraction  232 
punctured  threshold  410 
quadratic  14 
range  1 1 

rate  of  growth  13(*) 
real  number  use  by  1 2 
realizing  subfunction  of  47 
reductions  between  46(*) 
semi-disjoint  421 
slice  43 1(*) 
space-bounded  342(*) 
successor  23 1 
superpolynomial  330 
symmetric  74(*) 
total  210 
transition  212 
truth  table  12  40 
zero  23 1 

Furst,  M.459  610 
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G 

Gabarro,  J.389  606 

Galil,  Z.389  457  603  609  610  615 

game(s), 

communication, 
complexity  of  437 
monotone  441  442 
geography  369 
I/O  limited  531 

memory-hierarchy  pebble  533(*) 
rules  533 

monotone  communication,  adversarial 
strategy  447 

on  graphs,  PSPACE-complete  problems 
relationship  365 
pebble  24  25 

basic  lower  bounds  method  470 
branching  program  comparison  with  493 
brief  history  6 
lower  bounds  4700 
playing  463(*) 

red-blue  26  5300,  5320,  542 
space-time  tradeoff  analysis  with  46 1 
worst-case  tradeoffs  4830 
universal  vs  existential  369 

gap  theorem316  337 
Garey,  M.  R.389  610 
Gaskov,  S.  B.89  610 
gate(s)392 
circuit  38 
logic  16 

Gaussian  elimination274 

GENERALIZED  GEOGRAPHY  language370 

Gentleman,  A.  M.323  610 

geography  game369 

Gecseg,  F.620 

Gibbons,  A.323  610 

Gilbert,  E.  N.457  610 

Gilbert,  J.  R.526  610 

global  routing  networks310(*) 

Goldmann,  M.458  610 
Goldschlager,  L.  M.323  389  390  610 
grammarl81 

context-free  22  183 

Chomsky  normal  form  187 
context-sensitive  183 
phrase-structure  182 
regular  153  184 
graphs, 

bipartite  467 


graphs  (cont.) 

bisection  width,  network  287 
butterfly, 

as  ascending  algorithm  301 
comparator  network  replacement  with 

273 
FFT  238 
network  289 
circuit  as  37 
configuration  218  334 
diameter,  network  287 
directed, 

adjacency  matrix  relationship  248 
paths  249 
directed  acyclic  1 0 
circuits  238 
logic  circuit  as  a  1 6 
embedding  problem  289 
FFT  463 
hypercube  288 
inner  product  541 
mesh  288 
path  in  a  10 
pizza  pie  600 
pyramid  465 
pebbling  466 
trees  288 
undirected  10 
Greenlaw,  R.389  390  610 
grep  commandl68(*) 

Grigoriev,  D.  Yu.47 1  52 7  610 
Grigorievs  lower-bound  method468(*),  470 
471  4720 

Guibas,  L.  J.601  602  610 

H 

El-tree  VLSI  chip  layout5810 

matrix-vector  multiplication  5820 
prefix  computation  on  5830 

Elagerup,  T.323  606 
Elaken,  A.457  610 

HALF-CLIQUE  CENTRAL  SLICE, 

function  435 
language  435 

ELALT  (halt  register)  111 

simple  CPU  design  spec  138 

halt  state, 

DTM  119 

TM,  nondeterministic  120 
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halting, 

problem  227  228 

halting  (cont.) 

program  113 

HAMILTONIAN  PATH  language387 

Harper,  L.  H.456  610 
Hartmanis,  J.336  389  610  611 
Hastad,  J.458  459  610 
Hatcher,  P.  J.323  611 
Heideman,  M.  T.279  611 
height,  parse  tree  186 
Heintz,  C.  A.603  611 
Hennessy,  J.532  61 1 
Herley,  K.  T.323  607 
Hewitt,  C.  E.526  573  616 

hierarchy, 

Chomsky  5  1 82 
memory, 

HMM  562H 
tradeoffs,  (chapter)  529(*) 
space  336(*) 
time  3360 

Hillis,  W.  D.323  611 

history  of  theoretical  computer  science4(*) 
HMM  (hierarchical  memory  model)562(*) 

cost  of  problems  in  565 
lower  bounds  564(*) 
upper  bounds  567 (*) 

Hochschild,  P.602  611 
Hockney,  R.  W.322  611 
Hodes,  L.456  611 
Hong,  J.-W.537  573  611 
Hong-Kung  lower-bound  metliod537(') 
Hoover,  H.  J.72  88  389  390  455  606  610  61 1 
Hopcroft,  J.  E.207  236  278  279  389  526  605 
611 

Horn  clause385 
Hromkovic,  J.603  611 
Huffman,  D.  A.207  611 
hypercube(s)288 

based  machines  298(*) 
broadcasting  on  303(*) 
cycle  shifting  on  3030 
embedding  arrays  in  2990 
fast  matrix  multiplication  on  3080 
normal  algorithms  30 1(*) 

PRAM  simulation  313(*),  3 1 5(*) 
sorting  algorithm  302 
summing  on  302(*) 


I 

I/O  (input/output)26 

block  557 

in  the  MHG  555H 
bounded  problem  540 
bounds,  matrix-vector  product  539 
capacity  532 
complexity  563 
bounds  539 
brief  history  6 
I/O  time  bounds  535 
limited, 
game  531 

memory  hierarchy  game  533 
models, 

block-transfer  5590 
RAM-based  559(*) 
operations  26  559 
pebbling  strategy  531 
simple  560 

pads,  VLSI  layout  577 
time  559 
MHG  534 
pebbling  strategy  531 
space  tradeoffs  24(*),  539(*),  54 1  (*) , 
546(*),  5520 
time  bounds  536 
for  convolution  553 
for  FFT  547  551 
in  MHG  544  549 
red-blue  pebble  game  537  542 
ideal  PRAM312 
idempotence,  semirings252 
identities, 
matrix  240 

regular  expressions  160 
rings  239 

Immerman,  N.389  61 1 
Immerman-Szelepscenyi  theorem344 
implicant4 1 7 
implication  function4 1 0 
impossibility  theorem24  95  96 
in_wrdl 10 
in-degree  1 0 

incrementing/decrementing  counter  1 48 
independence  properties, 

cyclic  shifting  functions  474 
DFT  479 

matrix  multiplication  470 
wrapped  convolution  473 
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INDEPENDENT  SET  Ianguage357 

indirect  storage  access  fnnction404  407 

induction  1 5  (*) 

inequalities, 

computational  23(*),  95 
for  FSM  95(*) 
for  interconnected  FSMs  97 
for  random-access  memory  117 (*) 
for  TM  1270 
RAM  118 
VLSI  chips  5870 
Markovs  515 
initial  state92  214 
DFSM  154 
DTM  119 
NDTM  120 
PDA  177 
TM,  standard  210 
initialization, 

red-blue  pebble  game  530 
inner  product24 1 
graphs,  pebbling  472 
matrix  multiplication  242 
alphabet  92 
choice  99 
NDTM  120 

operation,  red-blue  pebble  game  530 
vertex  10 

INR  (input  register)  111 

simple  CPU  design  spec  138 
insertion  sorting  network270 
instruction, 

assembly  language  112 
direct  memory  140 
indirect  memory  140 
set,  simple  CPU  140(*) 
variable  143 
integer(s), 

addition  function  23 1 
INTEGER  PROGRAMMING  language  362 
multiplication  4750 
algorithm  63 

binary  function,  space-time  lower  bound 

475 

function  232 

function,  space-time  lower  bound  475 
space-time  lower  bound  507 
representation  8  58 

integrated  circuits575 
interconnection,  synchronous  FSM970 
interleaved  random-access  memory556 


interrupt  139 

intersection, 

CFL,  not  closed  under  199 
languages  170 
sets  7 

inversion, 

DFT  265 
matrix  243  252(*) 

algorithm  260 

Borodin-Cook  lower-bound  method 
application  5 1 1  (*) 

Csanky’s  algorithm  262 
fast  260(*) 

function,  reduction  from  matrix 
multiplication  to  253 
function,  triangular  matrices  256 
non-singular  243 

reduction  to  SPD  matrix  inversion  254 
space-time  lower  bound  512 
rings  239 

triangular  matricies  255(*) 

isomorphism, 

DFSM,  conditions  for  174 
Iverson,  K.323  611 


J 

JaJa,  J.279  323  527  602  611 

Jesshope,  C.  R.322  611 
Johnson,  D.455  611 
Johnson,  D.  H.279  611 
Johnson,  D.  S.388  389  610  612 
Johnson,  R.  B.603  612 
Jones,  N.  D.389  612 

jump  value, 

for  space  483 

Juurlink,  B.  H.  H.323  612 

(fc,  s)-Lupanov  representation80  82 

K 

Karatsuba,  A.88  612 
Karchmer,  M.458  612 
Karlin,  A.  R.323  612 

Karp,  R.  M.88  152  323  388  389  390  609  612 

Kasami,  T.207  612 
Kedem,  Z.  M.602  612 
Khachian,  L.  G.353  389  612 
Khang,  S.  M.601  622 
Khasin,  L.  S.456  612 
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Kirkpatrick,  D.  G.528  607 
Klawe,  M.  M.455  528  611  612 

Kleene, 

closure  9  158 

CFL  closed  under  198 
NFSM  acceptance  of  163 
star  158 

Kleene,  S.  C.236  612 
Kloss,  M.456  612 
Knuth,  D.  E.32  279  323  613 
Kohavi,  Z.61 1 
Komlos,  J.274  456  457  606 
Koutsoupias,  E.456  613 
Krapchenko  lower  bound407(*) 
Krapchenko,  V.  M.88  456  613 
Krichevskii,  R.  E.456  613 
Kronecker  product, 
nice  matrices  503 

three-matrix  product  in  terms  of  5 1 1 
Kumar,  V.  K.  P.602  611 
Kung,  H.  T.323  537  573  601  602  606  608 
609  610  611  612  613  617  618 

Kuroda,  S.  Y.207  613 


L 

L1  decision  problem338 
L  decision  problem338 
Laaser,  W.  T.389  612 
Ladner,  R.  E.  152  389  613 
Lamagna,  E.  A.457  613 
Landweber,  P.  S.207  613 
language(s)181 

2- SAT  363 

3- SAT  356 
3-coloring  359 
ACCEPTANCE  2 1  5 

BY  NDTM  AND  DTM  215  216 
DTM  119  333 
LIMITS  223(*) 

NDTM  120  333 
ALTERNATING  QUANTIFIED 
SATISFIABILITY, 

PSPACE  -complete  language  369 
assembly  112  140(*) 
instructions  112 

associated  with  a  decision  problem  329 
CFL,  Chomsky  normal  form  187 


language(s)  (cont.) 

CIRCUIT  SAT  132  355 
CIRCUIT  SATISFIABILITY  128 
CIRCUIT  VALUE  128  130  131  352 
closed  under  an  operation  170 
complements  170  329(*),  330 
complete  130(*) 
context-free  22  153  183(*) 
closure  properties  198(*) 
parsing  186(*) 

PDA  acceptance  192(*) 
context-sensitive  182  183(*) 
decidable  223  (*) 

decision  problems  relationship  to  328(*) 
difference  between  170 
DTM  ACCEPTANCE  354 
efficiently  parallelizable  380(*) 
element  distinctness  233 
equivalence  relations  on  1 7 1  (*) 

EXACT  COVER  360 

existence  of  languages  not  in  P  343 

finite  9 

formal  21  181 H 

Chomsky  language  hierarchy  5 
FSM, 

described  by  regular  expressions  164(*) 
GENERALIZED  GEOGRAPHY  370 
HALF-CLIQUE  CENTRAL  SLICE  435 
HAMILTONIAN  PATH  387 
INDEPENDENT  SET  357  358 
infinite  9 

INTEGER  PROGRAMMING  362 
intersection  of  170  171 
LINEAR  INEQUALITIES  353 
machine  140 

MONOTONE  CIRCUIT  VALUE  150  353 
NAESAT  353  356 
NC  380 

NDTM  machine  recognition  215 
non-recursively  enumerable  224 
NP  26  120 

condition  for  P  =  NP  130 
relationship  to  NDTM  26 
NP  -complete, 
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language(s)  (cont.) 

NP-complete  (cont.) 

brief  history  5 
reduction  to  132(*) 

NSPACE,  recognition  by  uniform  circuit 
family  375 

P  120 

condition  for  P  =  NP  130 
P-complete, 
brief  history  5 
log-space  reduction  131 
reduction  to  130(*) 

P/poly  382(*),  383 
phrase-structure  182  182(*),  2190') 
are  recursively  enumerable  220 
recursively  enumerable  languages  are  2 1 9 
programming,  brief  history  of  4 
properties,  context-free  1 97 (*) 

QUANTIFIED  SATISFIABILITY  365  366  367 
recognition  215 
by  FSM  22  92  154 
by  TM  210 
(chapter)  153 
DFSM  154 
NFSM  154 
TM  119 
recursive  210 
DTM  333 

recursively  enumerable  210  223  224(*) 
as  Chomsky  hierarchy  component  5 
but  not  decidable  225(*) 
phrase-structure  relationship  219  220 
reducibility  226(*) 
regular  22  153  158  170(*),  184(*) 
as  Chomsky  hierarchy  component  5 
conditions  for  174(*) 
conditions  for  finite  and  infinite  169 
machine  type  that  corresponds  to,  (table) 
182 

SATISFIABILITY  132  133  353  356 
strings  and  9(*) 

SUBSET  SUM  361 
TASK  SEQUENCING  361 
undecidable  228  229  230 
unsolvable  223 
verification  of  1 2 1 
latency26  284 
layout,  VLSI577(*) 

LDLt  factorization  of  SPD  matrices257(*) 
Le  Blanc,  Jr.,  R.  J.609 
Lehman,  P.  L.601  614 


Leighton,  F.  T.323  603  613  614 
Leiserson,  C.  E.323  601  613 
lemmas, 

approximator  circuits, 

on  negative  test  inputs,  (9.6.6)  427 
on  negative  test  inputs,  (9.6.7)  428 
positive  test  inputs,  (9.6.8)  429 
basis  change  effect  on  circuit  size  and  depth, 

(9.2.3)  396 
binary  trees, 

longest  path  length,  (11.9.1)  565 
longest  path  length  for  sorting,  (1 1.8.1) 
560 

number  of  unlabeled,  (2.12.2)  78 
Boolean, 

function  negations,  circuits  for,  (9.5.1) 
410 

matrices,  powers  of,  (6.4.1)  248 
matrix  multiplication  by  monotone 
circuits,  (9.6.3)  423 

matrix  multiplication  on  monotone 

circuits,  (9.6.4)  423 

branching  program, 

pebble-game  lower  bound  from,  (10.9.3) 
494 

RAM  lower  bound  from,  (10.9.4)  494 
ST  lower  bound  for,  by  reductions, 
(10.11.2)  500 

circuits, 

for  cyclic  shifting,  (2.5.1)  50 
for  demultiplexer  function,  (2.5.6)  55 
for  multiplexer  function,  (2.5.5)  55 
for  next-state/output  functions,  (3.5.1) 
120 

size  bound  for  indirect  storage  access 
function,  (9.4.1)  405 

size,  relationship  between  planar  and 
standard,  (12.6.1)  586 
class  of  Boolean  functions,  (9.3.1)  401 
clique  function,  positive  test  inputs  for, 

(9.6.5)  425 

clique  lower  bound  technical  lemma, 

(9.7.3)  445 

(9.7.4)  446 

(9.7.5)  446 

communication  complexity  no  more  than 
depth,  (9.7.1)  438 
commutative  rings, 
example,  (6.7.1)  264 
example,  (6.7.2)  264 
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lemmas  (cont.) 

comparator-based  merging  networks, 
disjoint  paths  in,  (10.5.5)  481 
counting,  function,  circuit  for  (2.11.1)  75 
CREW  PRAM  simulation  by  circuits, 

(8.14.1) 377 

cyclic  shifting  independence  properties, 

(10.5.2)  474 

decoder  function,  circuit  for,  (2.5.4)  54 
decomposition  of  trees  into  subtrees,  (9.2.4) 
397 

depth  no  more  than  communication 
complexity,  (9.7.2)  439 

DFT, 

independence  properties,  (10.5.4)  479 
vector-matrix  product  is  1  /4-ok, 
(10.13.5)  513 

distinguishability  properties, 

flow  property  relationship  to,  (10.1 1.1) 
500 

wrapped  convolution,  (10.13.1)  505 
encoder  function,  circuit  for(2.5.3)  52 
errors  with, 

6-approximator  of,  (9.7.6)  448 
,/n-approximator  of  parity,  (9.7.7)  450 
fan-out- 1  circuits  and  formula  size), 
relationship  between,  (9.2.2)  394 
FFT  decomposition,  (6.7.4)  267 
functions, 

cyclic  shifting,  circuit  for,  (2.5.1)  50 
cyclic  shifting,  reductions  between  logical 
and  (2.5.2)  51 

realizing  subfunction  of,  (2.4.1)  47 
I/O  time  bounds,  reductions  between, 

(11.3.2)  536 
inverse  DFT,  (6.7.3)  265 
Kronecker  product  of  nice  matrices, 

(10.12.2)  503 

lower  bounds,  indirect  storage  access 
function,  (9.4.2)  407 

matrix, 

multiplication  distinguishability 
properties,  (10.13.3)  509 
multiplication,  flow  properties,  (10.5.3) 
477 

multiplication,  independence  properties 
of,  (10.4.1)470 

multiplication,  5-span  for,  (11.5.1)  541 

nice  (10.12.1)  501 

product,  inverting,  (6.2.1)  243 


lemmas  (cont.) 
matrix  (cont.) 

vector  product  distinguishability 
properties,  (10.13.2)  508 
maximal  path  length  in  DAG,  (6.4.2)  249 
monom  removal,  (9.6.1)  418 
normal-form  branching  programs  equivalent 
to  general  ones,  (10.9.2)  492 
pebbling, 

balanced  binary  trees,  (10.2.1)  465 
pyramid  graph,  (10.2.2)  466 
pigeon-hole  principle,  (1.3.2)  16 
planar  circuit,  size  lower  bounds  for 

independent  functions,  (12.7.1) 

593 

planar  separator  theorem, 
conditional,  (12.6.2)  590 
multi-set,  (12.6.4)  592 
two-cost,  (12.6.3)  592 
PRAM  and  log-space  uniform  circuit 
relationship,  (8.14.2)  378 
proof  by  induction  example,  (1.3.1)  15 
pumping  153 

application  of,  (4.13.2)  198 

CFL,  (4.13.1)  197 

finite  and  infinite  regular  languages, 

(4.5.2)  169 

regular  languages  (4.5.1)  169 
QUANTIFIED  SATISFIABILITY  language, 
log-space  hard,  (8.12.2)  367 
PSPACE-complete,  (8.12.1)  366 
realization  of  log-space  transformations  on 
CREW  PRAM,  (8.14.3)  380 
realizing  subfunction  of  a  function,  (2.4.1) 

47 

reduction, 

between  logical  and  cyclic  shifting 
functions,  (2.5.2)  51 
from  matrix  multiplication  to  matrix 

inverse,  (6.5.1)  253 
of  matrix  inversion  to  SPD  matrix 
inversion,  (6.5.2)  254 
of  shifting  to  multiplication,  (2.9.1)  68 
of  squaring  to  reciprocal  function, 

(2.10.1)73 
use  of,  (5.8.1)  227 

regular  languages,  conditions  for  finite  and 
infinite,  (4.5.2)  169 

replacement  rule  semi-disjoint  bilinear  form, 

(9.6.2)  420 

rooted  tree  fan-in,  properties  of,  (9.2.1)  394 
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lemmas  (cont.) 

5-span  for  FFT,  (1 1.5.2)  546 
Schur  complement  of  SPD  matrix  is  SPD, 
(6.5.3)  255 

simple  I/O  time  lower  bounds,  (11.3.1)  535 
simulation  of  decision  branching  programs 
by  general  branching  programs, 

(10.9.1)  491 

slice  functions,  representation,  (9.6.9)  431 

squaring  function,  (2.9.2)  68 

states,  equivalence  relation  refinement, 

(4.7.1)  175 

superconcentrator, 

linear-size,  existence  of,  (10.8.1)  485 
technical  lemma  on,  (10.8.2)  486 
technical  lemma  on,  (10.8.3)  486 
three-matrix  product  in  terms  of  Kronecker 
product,  (10.13.4)  511 
tree  circuit,  for  binary  functions,  (2.12.1)  78 
unique  elements, 

distinguishability  properties,  (10.13.7) 
515 

technical  lemma,  (10.13.6)  514 
unsolvability,  (5.8.1)  227 
wrapped  convolution, 

distinguishability  properties,  (10.13.1) 
505 

independence  properties,  (10.5.1)  473 
Lengauer,  T.482  526  528  601  602  610  614 

length, 

path  10 
strings  9 

Leon,  S.  J.614 
level, 

multigraphs  492 

Leverrier’s  theorem26 1 
Levin,  L.A.88  389  614 
Lewis,  H.  R.236  389  614 
Lewis  II,  P.  M.389  610 
lexical  analysis  1 8 1 
lexicographical  order222 
Li,  Ming618 
Liang,  F.  M.602  610 
linear, 

arrays  292  2930,  294  304(*) 
bounded  automaton  182  204 
combination,  matrix  242 
equation  systems  241  242  262(*) 
equations,  solutions  263 
functions,  as  real  number  functions  13 


linear  (cont.) 

independence,  matrix  243 
recurrence,  first  order,  of  length  n  86 
LINEAR  INEQUALITIES  language, 
inequalities  353 
Lingas,  A.526  614 
Lipton,  R.  J.601  602  612  614 
list, 

adjacency  30 
ranking  problem  32 1 
literals, 

positive  385 

SATISFIABILITY  language  132 
load  balancing56 
local  routing  networks309(*) 
locality  of  reference558 
log-space, 

computations  342 

hard  for  PSPACE,  QUANTIFIED 

SATISFIABILITY  language  367 
P-complete  problems,  justification  for  352 
programs  129 

PSPACE  -complete  problems, 
ALTERNATING  QUANTIFIED 

SATISFIABILITYlanguage  369 
GENERALIZED  GEOGRAPHYlanguage 
370 

QUANTIFIED  SATISFIABILITY  language 

367  369 
reduction  131 
TM,  composition  of  351 
transformations, 

on  CREW  PRAM,  realization  of  380 
transitivity  of  350 
uniform, 
circuits  373 
circuits  374 
PRAMs  377 

logarithm  functions  1 3 
logic, 

circuits  392 
(chapter)  35(*) 
computational  model  1 6(*) 
computational  model,  VLSI  579 
dual-rail  84 
gate  16 

mathematical,  as  foundation  for  theoretical 
computer  science  4 
operations  48(*) 

logical  shifting  reduction50(*),  68 
LogP  model3  17(*) 
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loosely  coupled, 

computer  network  284 
Loui,  M.  C.526  528  614 

lower  triangular, 

matrix  240 

LRU  (least  recently  used)  page-replacement 
algorithm568 
Luccio,  F.6 1 8 
Luk,  W.  K.602  614 
Lupanov,  O.  B.89  614 
Lynch,  N.  A.528  607 

M 

machine(s), 

language  140 
programs  141 
with  memory, 

See  also  FSM;  PDA;  RAM;  Turing 
machine 
(chapter)  9 1  (*) 

main  diagonal, 

matrix  240 

many-to-one  reductions227 
mappings  1 1 

state-to-state  101 

MAR  (memory  address  register)  111 
simple  CPU  design  spec  138 
marker, 

end-of-tape  210 
Markov’s  inequality5 1 5 
Maruoka,  A.424  457  606 
masks576 
mathematical, 

logic,  as  foundation  for  theoretical  computer 
science  4 
preliminaries  70 
matrix(s)ll  240  (*) 
addition  function  242 
adjacency  1 1  248 
bad  504 
block  243 

Boolean,  powers  of  248 
characteristic  polynomial  260 
circulant  244 
decomposition  246 
good  504 
identity  240 
inversion  252(*) 
algorithm  260 


matrix(s)  (cont.) 
inversion  (cont.) 

Borodin-Cook  lower-bound  method 
application  5 1 1  (*) 

Csanky’s  algorithm  262 
fast  260(*) 

function,  reduction  from  matrix 
multiplication  to  253 
function,  triangular  matrices  256 
reduction  to  SPD  matrix  inversion  254 
space-time  lower  bound  512 
linear  combination  243 
lower  triangular  240 
main  diagonal  240 

multiplication  242  244(*),  4770,  509 
application  to  parsing  CFL’s  190 
Boolean  244  422 

Borodin-Cook  lower-bound  method 
application  509(*) 
family  of  inner-product  graphs  541 
fast,  on  a  hypercube  308(*) 
flow  properties  477 
independence  properties  of  470 
on  a  2D  mesh  2950 
on  a  hypercube  308 
on  linear  arrays  294 
reduction  to  matrix  inversion  253 
reduction  to  transitive  closure  250 
5-span  for  541 
size  and  depth  bounds  247 
space-time  lower  bound  472  479  511 
space-I/O  time  tradeoffs  54 1  (*) 
standard  algorithm  422 
Strassen’s  algorithm  245  0,  247 
three-matrix  product  space-time  lower 
bound  512 

nice  501 

Kronecker  product  503 
non-singular,  inverse  243 
ok  501 

permutation  244  477 
product,  inverting  243 
properties, 
nice  501(*) 
ok  5010 
rank  243 

scalar  product  240 
SPD  2530 

LDLt  factorization  of  257 
reduction  of  matrix  inversion  to  254 
Schur  complement  is  SPD  255 
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matrix(s)  (cont.) 

square  240 

standard  matrix  multiplication  algorithm 
242 

symmetric  240 
Toeplitz  243 
trace  261 

transitive  closure  248 
transpose  240 

triangular,  inversion  of  255(*) 
upper  triangular  240 
Vandermonde  265 
vector  product  24 1 

Borodin-Cook  lower-bound  method 
application  507(*) 

DFT513 

distinguishability  properties  508 
on  a  linear  array  293(*) 
on  an  H-tree  582(*) 
space-I/O  time  tradeoffs  539(*) 
space-time  lower  bound  508 
zero  240 
Mauchly4 
Maurer,  H.  A.6 1 8 
maxterm43 
monotone  441 
McColl,  W.  F.602  614 
McCulloch,  W.  S.207  615 
McNaughton,  R.207  615 
MDR  (memory  data  register)  111 
simple  CPU  design  spec  138 
Mead,  C.A.323  601  613  615 
Mealy,  G.  H.152  207  615 
Mealy  machine200 
FSM  93 

Mehlhorn,  K.323  457  458  601  602  606  614 
615 

memory, 

address  111 

MAR,  simple  CPU  design  spec  138 
sequence  568 

bounded,  RAM  19  111  122 
distributed, 
computer  285 
routing  in  309 
shared,  computer  285 
fast,  simulation  in  MHG.  558(*) 
hierarchical  models  562(*),  563 
pebble  game,  See  MHG 
tradeoffs,  (chapter)  529(*) 
interleaved  random-access  556 


memory  (cont.) 

locality  of  reference  558 
machines  with,  (chapter)  9 1  (*) 
management  567 

algorithms,  two-level  568(*) 
competitive  567 (*) 

number  of  gates,  RISC  and  CISC  CPUs 
compared  with  138 

organizations,  language  relationship  to  5 
page-replacement  algorithms  567 
FIFO  568 
LRU  568 
MIN  568 

parallel  computers  with  283(*) 
random- access  114(*) 
circuit  116 

shared,  computer  284 
unbounded,  RAM  111 
units,  clocked  106 

virtual  memory-management  systems  567 

merging, 

bitonic,  sorting  via  27 1(*) 
block,  algorithm  for  561 
efficient  branching  programs  for  496(*) 
monotone  circuits  lower  bounds  for  414 
networks  270(*),  48 1  (*) 

Batcher’s  bitonic  271  273 
comparator-based  481 
space-time  lower  bound  482 
problem  270 
mesh(es)288 
2D  arrays, 

embedding  ID  arrays  in  297(*) 
fully  normal  algorithms  on  306(*) 
matrix  multiplication  on  295(*),  296 
normal  algorithm  on  307 
simulation  on  ID  array  298 
multi-dimensional  292  292(*) 
layouts,  VLSI  chips  583(*) 
row-major  order  293 
toroidal  293 
of  trees  319  599 

message, 
passing  284 
priority  316 
metal  migration577 
Meyer,  A.  R.389  456  609  619 
Meyer  auf  der  Heide,  F.528  607  615 
MHG  (memory-hierarchy  pebble 
game)  533  (*) 
block  I/O  in  555(*) 
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MHG  (memory-hierarchy  pebble  game) 
(cont.) 

convolution  bounds  553 

fast  memory  simulation  in  558(*) 

I/O  time  bounds, 
for  FFT  in  5  5 1 

for  matrix  multiplication  in  544 
on  FFT  graph,  bounds  549 
playing  534 
rules  533 

Micali,  S.607  610 
micro  cyclel39 
micro-instructions  139 
simple  CPU  142 

affecting  registers  145 
microcode, 

execute  portion  143 
simple  CPU  142 
Miller,  G.  A.207  608 
Miller,  R.  E.612 

MIMD  (multiple  instruction,  multiple 

data)285 

MIN, 

page-replacement  algorithm  568 

minimal, 

DFSM,  equivalence  relation  177 
FSM, 

algorithm  175(*) 
algorithm  for  176 
conditions  for  174 
pebbling  534 
strategy  531 
minimization232 
state  171  (*) 
problem  158 
minimum, 

feature  size,  VLSI  chip  wires  578 

space,  existence  of  graph  requiring  large  488 

minterm42 

monotone  441 

MISD  (multiple  instruction,  single  data)286 
models, 

branching  program  488(*) 

BSP  3170 
circuit  372(*),  3920 

parallel  memoryless  computational  282 
computational  16(*) 

branching  program  comparison 

with  493  (*) 
logic  circuits  as  16(*) 
parallel  282(*) 


models  (cont.) 

computational  (cont.) 

(part  I  -  chapters  2-7)  35 
restricted,  representing  2170 
serial  3310 
VLSI,  (chapter)  5750 

data  parallel  2860,  286 

FSM  1540 

hierarchical  memory  562(*),  563 
I/O, 

block-transfer  5590 
RAM-based  5590 
LogP  3170 
machine, 
parallel  29 
sequential  5 
MIMD  285 
MISD  286 
nondeterministic  4 
PRAM  32  311  3760 

as  canonical  structured  parallel  machine 
3110 
RAM  190 
role  and  types  3 
SIMD  285 
SISD  285 

SPMD,  data  parallel  model  implementation 
by  287 

TM,  standard  2100 
VLSI, 

computational  5790 
diffusion  579 
physical  5780 
synchronous  579 
transmission  579 
transmission-line  579 
modulo-p  counterl48 
modulus, 

functions,  as  symmetric  function  74 

Monier,  L.  M.601  608 

monom4 1 7 

removing  418 

monotone, 

basis  392 

circuits  27  353  392 
communication  game  441 
rules  442 

depth,  communication  complexity 
relationship  4400 
functions  85  392 

replacement  rules  418 
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monotone  (cont.) 

implicant  417 
increasing  392 
maxterm  44 1 
minterm  44 1 
prime  implicant  417 

MONOTONE  CIRCUIT  VALUE  languagel50 
353 

Moore,  E.  F.93  152  207  615 
Moore  machine, 

FSM  93 

Muller,  D.  E.88  89  455  457  615  617 

multi-dimensional  meshes292  292(*) 
multi-tape, 

TM  119 

multigraphs489 

level  492 

multilective  computation580 
multiplexer54(*) 

function,  circuit  for  55 

multiplier, 

carry-save,  circuit  for  66 
divide-and-conquer,  circuit  for  67 

multiway  decision  tree561 
Munro,  1.279  607 
Myhill,  J.174  207  615 
Myhill-Nerode  theoreml74(*) 

N 

n-indistinguishable  states  175 
NAESAT  language356 
naming  function  16 
Nassimi,  D.323  609 

natural  numbers8 
NC  languages380 
NDTM  (nondeterministic  Turing 
machine)  120(*),  214(*) 

See  also ,  DTM;  Turing  machine  (TM) 
language  acceptance  333 
by  both  DTM  and  215  216 
multi-tape  333 

DTM  simulation  of  216 
NP  language  120 
relationship  to  26 

NP  problems  335 
one-tape  333 
recursive  language  333 

Neciporuk,  E.  1.456  457  615 
Neciporuk  lower  bound405(*) 


near-ring86 

negations, 

Boolean  function  409  410 

negative  literal385 
neighborhood, 

set  408 

Nerode,  A.  174  207  615 

networks, 

Benes,  global  routing  network  example  310 
brief  history  6 
CCC  307 

normal  algorithms  on  308 
comparator  270  271 
computer  284  287 (*) 
connection  289 
crossbar  289 

hypercube,  PRAM  simulation  315(*),  317 
merging  270(*),  48 1(*) 

Batcher’s  bitonic  271  273 
comparator-based  481 
space-time  lower  bound  482 
mesh  of  trees  319 
permutation  310 
routing  309(*) 
sorting  270(*) 

Newton  approximation  algorithm69 
NEXPTIME  class337 

next-set  function, 

NFSM  154 

next-state  function  18  92  214 

DFSM  154 
DTM  119 
NDTM  120 
standard  TM  2 1 0 

NFSM  (nondeterministic  finite-state 

machine)98(*),  154 
acceptance  of  163  164 
DFSM  equivalence  156 
Kleene  closure  acceptance  by  1 63 
languages  accepted  by,  same  as  languages 
accepted  by  DFSMs  156 
regular  expression  recognition  by  160  161 
162 

regular  language  acceptance  185 
nice  matrices501 

Kronecker  product  503 
Nishino,  T.456  606  620 
NL  problems338 

2-SAT  language  in  363 
complexity  class  relationships  341 
no-opll5 
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Nodine,  M.  H.573  615 
non-ambiguous  languages  187 
non-redundant, 

branching  program  490 

non-singular, 

matrix  243 

non-terminal, 

phrase-structure  grammar  182 
self-embedding  206 
symbols  22 
nondeterministic9  8 
FSM,  See  NFSM 
models  4 
PDA  177 

Turing  machine,  See  NDTM 
normal  algorithms30 1  (*) 
on  2D  array  307 
ascending  306 
AT2  upper  bound  for  585 
on  CCC  networks  308 
cyclic  shifting  on  the  hypercube  304 
fully  301  306 
normal  form, 

Boolean  function  expansions  42(*) 
branching  program  492 
comparison  of  45(*) 
conjunctive  43 
disjunctive  42 
product-of-sums  44(*) 
ring-sum  45  (*) 

standard  circuit  construction  methods  40 
sum-of-products  44(*) 

normalization263 

notation, 

big  Oh,  0(  )  13 
big  Omega,  )  13 
big  Theta,  O  (  )  1 3 
binary  relations  9 

computational  work  done  by  a  FSM  24 
empty, 
set  7 
string  9 

equivalence  classes  1 0 
integer  operations  8 
positive  closure  9 

product,  equivalent  number  of  logic 
operations  employed  24 
register  transfer  142 
set  7  9 

NP  (nondeterministic  polynomial  time), 

complement  of  347 (*) 


NP  (nondeterministic  polynomial  time) 
(cont.) 

complete, 

problems  that  are  127 
reducibility  used  to  identify  227 
simulation  use  to  show  23 
complete  language  130(*) 
brief  history  5 
complexity  theory  role  128 
reduction  to  132(*) 

distinguishing  P  from,  circuit  complexity  as 
method  for  391 

equal  to  P  question,  as  outstanding 

computer  science  problem  121 
language  120 

condition  for  P  =  NP  130 
relationship  to  NDTM  26 
P  as  subset  of  121 
problems  335 

NP-complete  problems355(*) 

3-COLORING  language  359 
3-SATlanguage  356 

boundary  between  P-complete  problems  and 
363(*) 

CIRCUIT  SAT  355 

EXACT  COVER  language  360 

HALF-CLIQUE  CENTRAL  SLICE  language 

435 

INDEPENDENT  SET  language  357  358 
INTEGER  PROGRAMMING  language  362 
justification  for  352 
NAESAT  language  356 
SATISFIABILITY  language  356 
slice  functions  435 
SUBSET  SUM  language  361 
succession  of  reductions  358 
TASK  SEQUENCING  language  361 
NPSPACE, 

complexity  class  relationships  341 
decision  problem  338 
language  375 
NSPACE(r(ro))334 

TIME(r(n))  relationship  with  341 
NTIME(r(n))334 

SPACE(r(n))  relationship  with  341 
number(s), 
natural  8 
systems  8(*) 
number  system, 

complementary  432  433 
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0 

oblivious, 

data  309 

odd-even, 

transposition  sort  294 

Oettinger,  A.  G.207  615 
offline  algorithm567 
Ofman,  Yu.88  456  612  616 
ok  matrices501 
one-dimensional  meshes292 
online  algorithm567 
OPC  (operation  code  register)  111 
simple  CPU  design  spec  138 
operator, 

associative  48  56  102 
balanced  binary  tree  48 

oracle, 

function  216 

tape  216  333 

oracle  Turing  machine, 

See,  OTM 

ordering, 

snake  row  297 

OTM  (oracle  Turing  machine)216  333 
out_wrdl 10 
out-degree  1 0 
output, 

alphabet  92 
function  18  92 

next-state/ output  RAM  functions, 
circuits  for  120 

operation,  red-blue  pebble  game  530 
vertex  10 

OUTR  (output  register)  1 1 1 

simple  CPU  design  spec  138 

overflow, 

addition  6 1 

P 

7 

P  =  NP  problem, 

importance  of  336 

outstanding  computer  science  problem  121 
TM  complexity  vs  circuit  size  complexity  as 
tool  for  resolving  128 
P  (polynomial  time) , 

algorithm,  CFL  recognition  189 
characteristics  5 


P  (polynomial  time)  (cont.) 

complexity  class  relationships  341 
to  each  other  and  to  381 
distinguishing  NP  from,  circuit  complexity 
as  method  for  391 
existence  of  languages  not  in  343 
hard  problems,  LINEAR  INEQUALITIES  353 
log-space  contained  in  342 
problems  130(*),  328(*),  335 
reduction  132 

P-complete  problems  120  130  352(*) 

boundary  between  NP-complete  problems 
and  363(*) 
brief  history  5 
complexity  theory  role  128 
condition  for  P  =  NP  130 
CREW  PRAM  solutions  380 
DTM  ACCEPTANCE  354 
examples  of,  CIRCUIT  VALUE  language  128 
justification  for  352 
log-space  reduction,  131 
MONOTONE  CIRCUIT  VALUE  353 
problems  that  are  127 
reduction  to  130(*) 
subset  of  NP  121 
P/poly  languages383 
page, 

fault  567 

replacement  algorithms  567  568 

pairing  function382 
Pan,  V.  Y.607 

Papadimitriou,  C.  H.152  236  347  389  390 

602  614  616 

parallel, 

algorithms,  performance  289(*) 
computation  27(*) 

(chapter)  281 
circuit  models  372 
models  282(*) 
thesis  379(*) 
computers  282  284 
Amdahl’s  law  290(*) 
asynchronous  285 
Brent’s  principle  29 1  (*) 

Flynn’s  taxonomy  285 
memoryless  282(*) 
synchronous  285 

unstructured,  circuit  as  form  of  283 
with  memory  283(*) 
data  model  286(*) 

languages,  efficiently  parallelizable  380(*) 
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parallel  (cont.) 

machines, 

P-complete  language  problem  128 
PRAM  6  27  29  31  If) 
prefix,  circuits,  efficient  57(*) 

parallelizable38 1 
parity, 

bounded-depth  parity  circuits, 
exponential  size  448(*) 
exponential  size  of  450 
communication  problem  438 
function  43 
parsing22  181 
CFL  186(*) 

Cocke-Kasami-Younger  algorithm  189 
parse  tree  1 86 

Chomsky  normal  form  197 
parser  186 

partial, 

computations  210  211 
functions  11210 

DTM  computation  333 
NDTM  computation  333 
recursive  232  233(*) 
standard  TM  2 1 0 
TM  119 
partition, 
balanced  425 

Paterson,  M.  S.89  455  456  457  458  526  573 
602  609  614  616 

path(s)  1 0 

directed  graph  1 0 

maximal  path  length  249 
elimination  method,  monotone  circuits, 
lower  bounds  derivation  413(*) 
external  length  451 
length  10 

binary  search  tree  564 
longest, 

binary  tree  565 
for  sorting,  binary  tree  560 
monotone  circuits  414 
rich  499 

unddirected  graph  1 0 
vertex-disjoint,  monotone  circuits  415 
Patterson,  D.323  532  609  611 
Paul,  W.  J.455  456  526  61 1  616 
Paz,  A.611 

PC  (program  counter)  111 
simple  CPU  design  spec  138 
PDA  (pushdown  automata)20  1 77 (*) 


CFL  acceptance  192  192(*) 

(chapter)  1 53(*) 

PDA  (pushdown  automata)  (cont.) 

computational  model  5  20(*) 
languages  accepted  by,  are  context-free  194 
one-way  input  tape  178 
stack  178 

state  diagram  179  180 
TM  relationship  217 

pebble  game24 

basic  lower  bounds  method  470 
branching  program  comparison 

with  488  (*),  493 
brief  history  6 
lower  bounds  470(*) 
memory-hierarchy  533(*) 
pebbling, 

balanced  binary  trees  465 
FFT  graph  463 
inner  product  graphs  472 
minimal  534 
pyramid  graph  466 
strategy  531  558 
playing  463(*) 
red-blue  26 

deletion  of  pebbles  530 
playing  532(*) 
rules  and  strategies  530(*) 
relationship  to  red-blue  pebble  game  530 
rules  and  strategies  462 
space  lower  bounds  470(*),  471 
space-time  tradeoff  analysis  with  46 1 
worst-case  tradeoffs  483(*) 
period, 

computation  582 
VLSI  chip  580 
Peries,  M.207  606 
permutation74  244 

bit  reverse  267 
matrix  244  477 

network  310 

Benes,  global  routing  network  example 
310 

routing  problem  309 
shuffle,  on  linear  arrays  304(*) 
unshuffle,  on  linear  arrays  304(*) 

Peterson,  G.  L.456  616 
phrase-structure  languages  182(*) 
are  recursively  enumerable  220 
machine  type  that  corresponds  to,  (table) 
182 
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phrase-structure  languages  (cont.) 

recursively  enumerable  languages  are  2 1 9 
TM  and  219(*) 

Pietracaprina,  A.323  607 
pigeon-hole  principle  1 5  (*),  16  168 

pipelining307 

Pippenger,  N.152  390  455  456  457  526  527 
528  611  616 
Pitts,  E.207  615 
pizza  pie  graph600 
planar  circuit  size586(*) 
lower  bound, 

for  independent  functions  594 
in  terms  of  w(u,  v)-How  593 
relationship  between  AT2  and  A2T  and  589 

planar  graph, 

face  590 
triangular  590 

planar  separator  theorem589(*)>  591 
conditional  590 
multi-set  592 
two-cost  592  600 

Plaxton,  G.323  609 
pointer  doubling321 
polylogarithmicl3  79 

polynomial  1 2 

advice  function  382 
characteristic,  of  a  matrix  260 
functions,  as  real  number  functions  12 
language  in  NP  120 
language  in  P  120 
time,  See  P 
pop, 

PDA  177 
state  180 

ports, 

VLSI  layout  577 

POSE  (product-of-sums  expansion)44(*) 
positive, 

closure  9  158 

instance,  monotone  communication  game 
442 

literal  385 
test  inputs  425 

approximator  circuits  429 

possible  accept  state  179 
Post,  E.  L.236  616 

power, 

set  8 

Pracchi,  M.601  607 


PRAM  (parallel  random-access  machine) , 

as  parallel  machine  27 

as  synchronous  shared  memory  model  285 

brief  history  6 

characteristics  of  29 

circuit  relationship  378 

CRCW314 

CREW  380 

circuit  equivalence  376(*),  379 
circuits  and  317 (*) 
simulation  by  circuits  377 
efficiency  290 
EREW, 

CRCW  PRAM  simulation  314 
simulation  by  hypercube  network  317 
simulation  of  normal  algorithm  313 
hypercube  network  simulation  of  3 1 5  (*) 
log-space  uniform  377  378 
model  376(*) 

as  canonical  structured  parallel  machine 
31  in 

processor-time  tradeoff  290 
simulation,  of  trees,  arrays,  and  hypercubes 
313(*) 
speed  290 

Pratt,  V.  R.389  457  617 

precise, 

TM  334 

predecessor  function232 
predicate  15  232 
prefix, 

circuits,  parallel,  efficient  57(*) 
computation  5 5 (*) ,  583(*) 
segmented  56 
function  55 

parallel,  circuit  for  57 

Preparata,  F.  P.88  323  455  457  601  602  603 
606  607  614  615  617  622 

Preston,  Jr.,  K.613 
primality  problem347 

primality  is  in  intersection  of  NP  and  coNP 
348 

test  for  347 

prime, 

factorization  87 
implicant  417 

primitive  recursive  functions231(*) 
priority, 

message  316 
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PRIORITY  model, 

PRAM  313 
problems,  feasible  335 
complete  129  350(*),  351 

decision, 

classification  issues  328(*),  334(*) 
complement  of  329 

language  complements  and  329(*),  330 
regular  languages,  algorithms  171 
hard  350(*),  351 
NL  338 

2- SAT  language  in  363 
complexity  class  relationships  341 

NP-complete  352  355(*) 

3- COLORINGlanguage  359 
additional  examples  357(*) 
boundary  between  P-complete  problems 

and  363  (*) 

EXACT  COVER  language  360 
INDEPENDENT  SET  language  graph  358 
INTEGER  PROGRAMMING  language  362 
justification  for  352 
SATISFIABILITY  language  356 
SUBSET  SUM  language  361 
succession  of  reductions  358 
TASK  SEQUENCINGlanguage  361 
P  =  NP  5 

P-complete  352  352(*) 

boundary  between  NP-complete  problems 
and  363  (*) 

DTM  ACCEPTANCE  354 
justification  for  352 
MONOTONE  CIRCUIT  VALUE  353 
P-hard,  LINEAR  INEQUALITIES  353 
PSPACE  -complete  365 (*) 

ALTERNATING  QUANTIFIED 

SATISFIABILITY  language  369 
GENERALIZED  GEOGRAPHY  language 

370 

QUANTIFIED  SATISFIABILITY  language 

365  366  367  369 
state  minimization  158 
TSP,  NP-complete  association  with  5 
unsolvable  227 (*) 
product, 

Cartesian  8 
Kronecker  503 
matrix-vector  507(*),  508 
variables  44 

vector-matrix,  DFT  513 


program(s) , 

boot  141 
branching  488(*) 

comparison  with  other  computational 
models  493(*) 

straight-line  programs  vs.  496(*) 
correctness  5 
halting  113 
machine  language  141 
RAM  1120 
recursion  375 
straight-line  17  35  2380 
branching  programs  vs.  4960 
circuits  and  360 
tree  49 1 
programming, 

dynamic,  algorithm  165 
projection  function231 
proof, 

by  contradiction  150 
by  induction  150 
methods  of  150 

propagation, 

carry  59 

proper, 

integer  subtraction  function  232 
subset  7 
subtraction  232 

properties, 

algebraic,  of  Boolean  functions  400 
CFL  1970 
closure  1980 
non-closure  199 
closure,  regular  languages  170 
distinguishability, 

(<j>.  A,  /n,  v,  r)  497 
flow  property  relationship  to  500 
matrix  multiplication  509 
matrix-vector  product  508 
unique  elements  515 
flow  469 

distinguishability  property  relationship  to 
500 

functions  469  O 
matrix  multiplication  477 
independence, 

cyclic  shifting  functions  474 
DFT  479 

matrix  multiplication  470 
wrapped  convolution  473 
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properties  (cont.) 

matrices, 
nice  50 1(*) 
ok  501  (*) 

regular, 

expressions  159 
languages  170(*) 
rooted  tree  fan-in  394 
of  semirings  251 
sets  8 
trees  397 

protocol, 

communication  game  437 
pseudo-negations432 

realization  by  monotone  circuits  432 
PSPACE  decision  problem338 
PSPACE  -complete  problems365(*) 
ALTERNATING  QUANTIFIED 

SATISFIABILITY  language  369 
GENERALIZED  GEOGRAPHY  language  370 
QUANTIFIED  SATISFIABILITY  language  365 
tree  circuit  366 
Pucci,  G.323  607 
Pudlak,  P.456  617 
pumping  lemmal  53 
application  of  198 
CFL  197(*) 

FSM  168(*) 

regular  languages,  conditions  for  finite  and 
infinite  169 

punctured  threshold  function4 1 0 
pyramid  graph465 

pebbling  466 

Q 

quadratic  function  14 
quantification, 

existential  365 
universal  365 

QUANTIFIED  SATISFIABILITY  language365 
367  369 
tree  circuit  366 

quasiplanar577 

query, 

superfluous  499 
Quinn,  M.J.323  611  617 


R 

Rabin,  M.  0.152  207  617 
radius  of  a  rooted  spanning  tree589 
radix  sort286 

RAM  (random-access  machine), 

architecture  1 1 0(*) 
as  serial  computational  model  331  (*) 
based  I/O  models  559(*) 
bounded-memory  111 
branching  program  simulation  495 
circuits,  next-state/output  functions  120 
computational  inequalities  for  1 17(*),  118 
computational  models  19(*) 

FSM  11 1(*) 

memory  hierarchy  simulations,  speed  and 
size  tradeoffs,  (chapter)  529(*) 
programs  112(*),  113 
simulation  122  332 
space  332 
use  495 
time  332 

TM  relationship  to  124 
unbounded-memory  111 
universal  114(*) 

Ramachandran,  V.323  388  390  612 
Ranade,  A.323  617 
Randell,  B.32  617 
random-access  memory  1 9  1 1 4(*) 

architectural  components  110 
circuit  116 
design  115  (*) 
interleaved  556 

range, 

of  a  function  1 1 

rank, 

matrix  243 

RASP  (random-access  stored  program 
machine)  114 
rate  of  growth, 

functions  13(*) 

Raz,  R.458  617 
Razborov,  A.  A.457  459  617 

reachability, 

algorithm, 

paths  explored  by  344 
reachable  vertex  counting  program  345 
problem  338 
readl  15 

once  computation  580 
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real  numbers, 

functions  using  12 
reciprocal, 
algorithm  70 
division  and  68 
function, 

circuit  for  72 

reduction  of  squaring  to  73 
integer  68 

reduction  from  72(*) 

Reckhow,  R.  A.323  389  608 

recognition, 

language  215 
by  FSM  154 
(chapter)  153 
DFSM  154 
NFSM  154 
TM  119 

regular, 

expressions,  by  FSM  160(*),  161 
languages  184(*) 
languages,  by  FSM  185 
records, 

activation  339 
complete  339 

rectilinearity, 

VLSI  wire  layouts  578 

recurrence, 

first-order  linear,  of  length  n  86 

recursion, 

decomposition,  of  set  of  strings  166 
enumerable  language,  as  Chomsky  hierarchy 
component  5 
language,  DTM  333 
partial  recursive  functions  231  232(*) 

RAM  computability  of  233(*) 
primitive  recursive  functions  23 1  (*) 
standard  TM  210 

recursively  enumerable  languages223 

are  phrase-structure  219 
but  not  decidable  226  228 
Chomsky  hierarchy  component  5 
decidable  225  (*) 

phrase-structure  languages  are  220 
standard  TM  210 
red  pebble  game, 

See,  pebble  game 
Red’kin,  N.  R88  455  457  617 

red-blue  pebble  game, 

See  also,  pebble  game 


red-blue  pebble  game  (cont.) 

I/O  time  bounds  for  matrix  multiplication 
in  542 

on  FFT  graph,  computation  and  I/O  time 
lower  bounds  547 
playing  532(*) 
rules  and  strategies  530(*) 

reducibility226(*) 

classifying  languages  as  unsolvable  using  227 
unsolvability  and  226(*) 

reduction348(*) 

between  logical  and  cyclic  shifting  functions 

51 

CIRCUIT  SAT  language  to  NAESATlanguage 
357 

from  Turing  to  circuit  computations  128(*) 
function  46(*) 

I/O  time  bounds  536 
integer  reciprocal  72(*) 
log-space  131 

logical  and  cyclic  shifting  50C*) 
many-to-one  227  348 

multiplication  68(*) 

NP  -complete  languages  132(*) 

P-complete  languages  130(*) 
polynomial  time  132 
problem-solving  method  35 
of  squaring  to  reciprocal  function,  reduction 
of  squaring  to  73 
subfunction  relationship  46 
to  complete  problems  129 
Turing  348  385 
refinement, 

equivalence  relation  173 
on  states  175 

reflexive, 
relation  10 
register(s)  109 

pebble  game  relationship  to  6 
set  138(*) 
simple  CPU  138 
transfer  notation  142 

regular, 

expressions  158(*) 
equivalence  of  159 
FSM  and  160(*) 

FSM  languages  described  by  164(*) 
NFSM  recognition  of  160  161  162  163 
properties  of  159 
recognition  by  FSM  1 60(*) 
string  search  use  168(*) 
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regular  (cont.) 

grammar  184 
languages  22  158  184(*) 

as  Chomsky  hierarchy  component  5 
as  Chomsky  language  type  1 82 
closure  properties  170 
conditions  for  174(*) 
conditions  for  finite  and  infinite  169 
decision  problems  on,  algorithms  171 
machine  type  that  corresponds  to,  (table) 
182 

properties  of  170(*) 
pumping  lemma  169 
recognition  184(*) 
regular  language  acceptance  185 
machine  recognition  problem  229 
machine  recognition  problem  229 
set  158 

Reif,  J.  H.72  88  323  610  617  618  619 

Reischuk,  R.526  618 
reject  statel79 
relations9(*) 

equivalence  10  172 
DFSM  172 

for  a  language  172 
on  languages  1 7 1  (*) 
on  states  171  (*) 
right-invariant  172  173 
reflexive  10 
symmetric  10 
transitive  10 
Rem,  M.601  615 
replacement, 

function  replacement  method,  monotone 
circuit  lower  bounds 
derivation  417  (*) 
rules  417 

monotone  functions  418 
semi-disjoint  bilinear  form  420 

representation, 

integers  8 

(fc,  s)-Lupanov  80  81  82 
restricted  models  of  computation  217 (*) 
standard, 
binary  8 
decimal  8 
reset,  flip-flop  109 
resource, 

bounds  330(*) 

transformations  348 


resource  (cont.) 

vector  534 
rewriting  strings  1 8 1 

Rice,  H.  G.236  618 
Rice’s  Theorem229  230 
rich  path499 

right-invariant  equivalence  relation  172  173 

rings239(*) 

commutative  264(*) 
linear  arrays  292 
matrix  multiplication  242  245 
near  86 
semirings  251 
Riordan,  J.89  618 
ripple  adder58  107 

RISC  (reduced  instruction  set  computer)  138 
root(s), 

of  unity,  in  commutative  rings  264 
rooted  directed  acyclic  multigraph  489 
vertex  489 

Rosenberg,  A.  L.603  618 
routing309 

networks  309(*),  310(*) 
permutation  problem  309 

row-major  order, 

meshes  293 

RSE  (ring-sum  expansion)45(*) 

Ruane,  L.  M.602  613 
rules, 

absorption,  in  Boolean  expressions  41 
DeMorgan’s,  in  Boolean  expressions  41 
replacement  417 

semi-disjoint  bilinear  form  420 
Ruzzo,  W.  L.389  390  610 

s 

■S'-span, 

DAG  537 

matrix  multiplication  541 
safe, 

circuit  107 
Sahay,  A.323  609 
Sahni,  S.323  609 
Santos,  E.  E.323  609 
SATISFIABILITY  languagel32  133  328  356 
satisfiable328 
Savage,  C.602  618 
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Savage,  J.  E.89  152  389  455  456  457  526  527 
528  573  602  603  608  610  611  613 
618  619 

Savitch,  W.  J.323  389  618 
Savitch’s  theorem339  340 
Saxe,  J.459  610 
CIRCUIT  SAT  language355 
scalar  product, 
matrix  240 
Schafer,  T.  J.389  618 
Schauser,  K.  E.323  609 
Schmidt,  E.  M.458  615 
Schmidt,  H.  A.6 1 1 
Schnorr,  C.  P.152  455  619 
Schonhage,  A.67  88  619 
Schonhage-Strassen  circuit67 
Schur, 

Schiirfeld,  U.619 

complement  254 
factorization  254(*) 

Schutte,  K.6 1 1 

Schutzenberger,  M.  P.207  619 
Scott,  D.  152  207  617 

search, 

binary  565 
tree  564 

Sedgewick,  R.601  602  614 
self-terminating  machine  problem230 
semantics, 

programming  language,  brief  history  5 

semellective  computation580 
semi-disjoint  bilinear  form, 

replacement  rule  420 
semi-disjoint  function, 

circuit  size  lower  bound  42 1 
semigroup56 
semirings251 
separator  theorem, 
for  trees  397 
planar  589(*),  591 
conditional  590 
multi-set  592 
two-cost  592  600 
sequences, 
bi tonic  278 
sequential, 
circuits  106 

as  concrete  implementation  of  sequential 
machine  model  5 
constructing  from  a  FSM  92 


sequential  (cont.) 
circuits  (cont.) 

designing  106(*) 

machine,  sequential  circuit  as  concrete 
implementation  of  5 

serial, 

computation  thesis  330 
computational  models  331  (*) 
branching  program  488(*) 
space,  parallel  time  relationship  to  379 

series, 

expansion,  Taylor  73 

set(s)7 

binary  relation  over  9 
cardinality  7 
characteristics  of  7(*) 
difference  7 
disjoint  7 
final  states, 

DFSM  154 
PDA  177 
flip-flop  109 

instruction,  simple  CPU  140(*) 

intersection  7 

matrix  over  240 

membership  notation  7 

neighborhood  408 

power  8 

properties  8 

regular  158 

strings,  concatenation  9 
symmetric  difference  234 
totally  ordered  270 
union  7 

Sethi,  R.527  605  608 

shallow, 

circuits, 

simulating  addition  with  105(*) 
simulating  FSM  with  100(*) 

Shamir,  E.207  606 
Shannon,  C.  E.88  89  618  619 

contributions  to  theoretical  computer 
science  4 

shared  memory  computer284 
Shepherdson,  J.  C.389  619 
shifting, 

circuits,  cyclic  49 
cyclic  474(*) 
function  474 

functions,  independence  properties  474 
functions,  space-time  lower  bound  475 
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shifting  (cont.) 
cyclic  (cont.) 

on  the  hypercube  303(*),  304 
reductions  between  logical  shifting 
and  50  (*) 
functions  48(*) 
cyclic  48 

cyclic,  circuits  for  50 
logical, 

reduction  to  multiplication  68 
reductions  between  cyclic  shifting 
and  50  (*) 

Shriver,  E.  A.  M.573  621 
shuffle  permutations304 
on  linear  arrays  304(*) 

Siegel,  A.603  619 
signed  two’s  complement6 1 
SIMD  (single  instruction,  multiple  data) 
mode!285 
simulation23 

branching  program  49 1 
circuit, 

by  dataflow  computers  283 
of  FSM  95 

ofTM  124(*),  134(*) 

CPU  by  another  CPU  1470 
CRCW  PRAM,  by  EREW  PRAM  314 
CREW  PRAM,  by  circuits  377 
FSM,  by  shallow  circuits  102  104 
of  2D  array  on  ID  array  298 
of  fast  memory  in  the  MHG  558(*) 
of  normal  algorithm,  PRAM  EREW  313 
PRAM, 

by  hypercube  network  3150 
of  trees,  arrays,  and  hypercubes  3130 
by  precise  TM  334 
RAM, 

branching  program  495 
by  DTM  332 
by  TM  122 

TM,  single-tape  simulation  of  multi-tape 
213 

sink  vertex489 

Sipser,  M.89  456  459  602  607  610  616 

SISD  (single  instruction,  single  data) 
mode!285 

size, 

circuit  11  35  239 

as  quantity  whose  rate  of  growth  is 
significant  13 
basis  change  effect  on  396 


size  (cont.) 
circuit  (cont.) 

bounds  on  402 
fan-out  impact  on  3940 
gate-elimination  method  for  400(*) 
in  a  simple  CPU  1460 
monotone,  clique  function  430 
planar  586(*) 

simple  lower  bounds  on  400 
slice  function  relationship  432 
upper  bounds  on  790 
with  fan-out  S  393 

exponential,  bounded-depth  parity  circuits 
450 

formula  394 
bounds  on  397 
circuit  depth  vs  396(*) 
fan-out- 1  relationship  394 
lower  bounds  for  4040 
over  two  different  bases  399 
monotone  circuits,  slice  functions  434 
planar  circuits,  relationship  between  AT 2 
and  A2T  and  589 
polynomial,  circuits  of  382 
speed  tradeoffs, 

(chapter)  46 1  (*) 
in  memory  hierarchies  5290 
Skyum,  S.457  619 
Sleator,  D.  D.573  619 

slice  functions, 

central  slice  435 

circuit  size  relationship  432 

HALF-CLIQUE  CENTRAL  SLICE, 

function  435 
language  435 
monotone  circuits  43 1  (*) 

NP  -complete  435 
representation  43 1 
sliding, 

red-blue  pebble  game  462  530 

Smith,  C.  H.152  619 
Smolensky,  R.459  619 
snake  row  ordering297  316 
Snir,  M.563  573  605 

solvable  task2 1 0 
solving, 

linear  systems  262(*) 

Song,  S.  W.602  613 

SOPE  (sum-of-products  expansion)44(*) 
sorting, 

algorithm  301  302 


©John  E  Savage 


INDEX 


661 


sorting  (cont.) 

binary  85 

functions  as  symmetric  function  74 
monotone  circuits  lower  bounds  413 
bi tonic  271  272  278 
as  Borodin-Cook  lower-bound  method, 
application  516(*) 

BTM  561 
bubble  sort  294 

comparison-based,  lower  performance 
bounds  565 
linear  arrays  294(*) 

longest  path  length,  for  binary  tree  560 
networks  270(*) 

AKS  274 
fast  274(*) 
insertion  270 

odd-even  transposition  294 
problem  270 
radix  sort  286 

space-time  lower  bounds  516 
stable  sorting  algorithm  304 
via  bitonic  merging  271  (*) 
space, 

bounded, 

complexity  classes  338(*) 
complexity  classes,  time-bounded 

complexity  class  relationships  with 

34  in 

functions  342(*) 
branching  program  490 
deterministic,  nondeterministic  time 
contained  in  34 1 
hierarchy  336(*) 

I/O  time  tradeoffs  539(*) 
convolution  552(*) 

FFT  546(*) 

matrix-matrix  multiplication  54 1  (*) 
vector-matrix  product  539(*) 
jump  value  for  483 
log-space, 

contained  in  polynomial-time  342 
reduction  131 
lower  bounds  465(*) 
pebble  game  470(*) 

MHG  534 
minimum  465 

existence  of  graph  requiring  large  488 
nondeterministic  space  classes  closed  under 
complements  346 
OTM217  333 


space  (cont.) 

pebbling  strategy  531 

quantity  whose  rate  of  growth  is  significant 
13 

RAM  332  495 

serial,  parallel  time  relationship  to  379 
time, 

and  I/O  tradeoffs  24(*) 
bounds  on  MHG  544 
lower  bound,  cyclic  shifting  functions  475 
lower  bound,  DFT  480  513 
lower  bound,  integer  multiplication  507 
lower  bound,  matrix  inversion  512 
lower  bound,  matrix  multiplication  511 
lower  bound,  matrix-vector  product  508 
lower  bound,  merging  networks  482 
lower  bound,  sorting  516 
lower  bound,  unique  elements  516 
lower  bound,  wrapped  convolution  505 
product  for  branching  programs  500 
product,  matrix  multiplication  472(*) 
product  (ST)  118 
tradeoffs,  (chapter)  46 1(*) 
tradeoffs  in  memory  hierarchies,  (chapter) 
529(*) 

tradeoffs,  matrix  multiplication  479 
tradeoffs,  pebble  game  study  of,  brief 
history  6 

TM  333 

upper  bounds  483(*) 

SPACE(r(n))334 

NTIME(?'(n))  relationship  with  341 
space  hierarchy  337 

spanning  tree589 

BFS  591 

SPD  (symmetric  positive  definite) 

matrices253(*),  253 

inversion,  reduction  of  matrix  inversion  to 

254 

LDLt  factorization  of  2570,  257  258  259 
Schur  complement  of  SPD  matrix  is  SPD 

255 

Specker,  E.456  611 
speedup, 

PRAM  290 
Spira,P.  M.455  619 
Spirakis,  P.323  607  610 
Spivak,  M.72  619 
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SPMD  (single  program  multiple  data) 
model, 

data  parallel  model  implementation  by  287 
Sproull,  B.606  612  613  617  618 

square, 
matrix  240 

squaring  function68 
ST  (space-time  product)  118 
stable  sorting  algorithm304 
stack, 

alphabet,  PDA  177 
PDA  177 
stacking  state  178 
standard, 

basis  373  392 

of  a  logic  circuit  38 
representation, 
binary  8 
decimal  8 

start, 

symbols  22 
state(s)30 
accept  179 

assignment,  problem  107 
branching  program  495 
diagram  18  30 
FSM21 

equivalence  175 
relations  on  171  (*) 
relations  refinement  on  175 
final,  DFSM  154 
initial, 

DTM  119 
NDTM  120 
minimization  171  (*) 
problem  158 
n-indistinguishable  175 
next, 

DTM  119 
NDTM  120 

next-state/output  RAM  functions,  circuits 
for  120 

possible  accept  179 
reject  179 
set  of, 

DFSM  154 
DTM  119 
NDTM  120 
stacking  178 
to-state  mappings  101 
Stearns,  R.  E.336  389  610  61 1 


Steele,  G.323  606  611  612  613  617  618 

step, 

basis  15 

Stewart,  G.  W.6 1 3 
Stimson,  M.  J.323  618 
Stockmeyer,  L.  J.389  390  619 
Stone,  H.  S.323  619 

storage, 

access  function  54 
capacity  111 
TM  119 

stored-program  concept  1 10 
straight-line  program(s)17  35  238(*) 

algorithms,  lower  performance  bounds  565 
Boolean,  circuit  as  graph  of  37 
branching  programs  vs.  496(*) 
circuits, 
and  36(*) 

representation  of  37  238 
functions  computed  by  38 
realizing  subfunction  of  a  function  47 
Strassen,  V.67  88  245  278  618  619 

Strassen’s  algorithm245(*) 

matrix  multiplication  247 

strategy, 

adversarial  443  445  447 
pebbling  531  558 
strict  refinement  173 
string(s)9 
acceptance  92 
by  FSM  154 
DFSM  154 
DTM  119 
NFSM  154 

choice  input,  acceptance  by  NDTM  120 
concatenation  158 
empty  9 

encoding  of,  TM  and  222(*) 
languages  and  9(*) 
relation  to  alphabets  9 
searching  for,  with  grep  168(*) 
sets  of,  concatenation  9 
Sturgis,  H.  E.389  619 
Subbotovskaya,  B.  A.456  619 
subfunctions, 

realizing,  of  a  function  47 
relationship,  reduction  via  46 
Subramonian,  R.323  609 
subset(s)7 

SUBSET  SUM  language361 
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substitution, 

backward  263 

constants,  in  Boolean  expressions  4 1 

subtraction6 1  (*) 

function,  proper  232 

successor  function23 1 
succinct, 

certificate  100 
sum44 

Boolean  function  44 

summing, 

on  the  hypercube  302(*) 
operations  48 

superconcentrator485  486 
superfluous  query499 
superpolynomial  function330 
Swamy,  S.526  527  618  619 
symbol(s), 

non-terminal  22 
start  22 
terminal  22 
symmetric, 

difference,  between  sets  234 
elementary,  functions  as  symmetric  function 
74 

functions  74(*) 
circuits  for  76 
matrix  240 

positive  definite  matrices,  See  SPD 
relation  10 

synchronous, 

FSM  97 

circuit  simulating  98 
model,  VLSI  579 
parallel  computers  285 

systems, 

balanced  computer  532(*) 
number  8(*) 
systolic  array27  28  292 
Szelepscenyi,  R.389  620 
Szemeredi,  E.274  456  457  606 

T 

table  lookup493 

Tanaka,  K.456  606  620 

tape, 

alphabet  214 
DTM  119 
NDTM  120 


tape  (cont.) 

alphabet  (cont.) 

PDA  177 
standard  TM  210 
empty,  acceptance  problem  228 
enumeration  215 
head  20 
multi,  TM  119 
one,  TM  118 
oracle  216  333 
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